Deploying an LLM on GCP Cloud Run with GPU Support Using llama.cpp
In this post, I’ll walk you through deploying an LLM (specifically OpenAI’s gpt-oss-20b) on Google Cloud Run with GPU support using llama.cpp. We’ll also add nginx as a reverse proxy with basic authentication for added security. Why Cloud Run with GPUs? Google Cloud Run now supports NVIDIA L4 GPUs, making it an excellent choice for deploying LLMs with: Serverless scaling - Scale to zero when not in use Pay-per-use pricing - Only pay for actual compute time Simple deployment - No infrastructure management Fast cold starts - GPU instances start in ~5 seconds Prerequisites Before we begin, ensure you have: ...