In this post, I’ll walk you through deploying an LLM (specifically OpenAI’s gpt-oss-20b) on Google Cloud Run with GPU support using llama.cpp. We’ll also add nginx as a reverse proxy with basic authentication for added security.
Why Cloud Run with GPUs?
Google Cloud Run now supports NVIDIA L4 GPUs, making it an excellent choice for deploying LLMs with:
- Serverless scaling - Scale to zero when not in use
- Pay-per-use pricing - Only pay for actual compute time
- Simple deployment - No infrastructure management
- Fast cold starts - GPU instances start in ~5 seconds
Prerequisites
Before we begin, ensure you have:
- A Google Cloud Platform account with billing enabled
- The
gcloudCLI installed and configured - Docker installed locally (for testing)
- Required IAM roles:
- Artifact Registry Admin
- Cloud Build Editor
- Cloud Run Admin
- Service Account User
- Storage Admin
Request GPU Quota
First, request GPU quota for Cloud Run:
# Visit the quota page and request "Total Nvidia L4 GPU allocation, per project per region"
# https://g.co/cloudrun/gpu-quota
Enable Required APIs
gcloud services enable \
artifactregistry.googleapis.com \
cloudbuild.googleapis.com \
run.googleapis.com \
storage.googleapis.com
Configure gcloud
gcloud config set project YOUR_PROJECT_ID
gcloud config set run/region europe-west1 # or us-central1
Architecture Overview
Our deployment uses the following architecture:
Client Request
↓
nginx (port 8080)
[Basic Auth]
↓
llama-server (port 8081)
[GPU Inference]
Nginx handles authentication and proxies requests to the llama.cpp server running on the same container.
Project Structure
Create a new directory for your deployment:
mkdir llama-cloud-run && cd llama-cloud-run
Your final directory structure will look like this:
llama-cloud-run/
├── Dockerfile
├── nginx.conf
├── .htpasswd
└── start.sh
Step 1: Create the Nginx Configuration
Create nginx.conf to configure nginx as a reverse proxy with basic authentication:
events {
worker_connections 1024;
}
http {
server {
listen 8080;
# Basic authentication
auth_basic "Restricted Access";
auth_basic_user_file /etc/nginx/.htpasswd;
# Health check endpoint (no auth required for Cloud Run health checks)
location /health {
auth_basic off;
proxy_pass http://127.0.0.1:8081/health;
}
# Proxy all other requests to llama-server
location / {
proxy_pass http://127.0.0.1:8081;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Connection "";
# Disable buffering for streaming responses
proxy_buffering off;
# Long timeout for LLM inference
proxy_read_timeout 600s;
proxy_connect_timeout 60s;
proxy_send_timeout 600s;
}
}
}
Key configuration points:
- Port 8080: Cloud Run’s default expected port
- Health endpoint without auth: Cloud Run needs unauthenticated access to
/healthfor health checks - Disabled buffering: Essential for streaming LLM responses
- Long timeouts: LLM inference can take time, especially for long outputs
Step 2: Create Basic Authentication Credentials
Generate the .htpasswd file for basic authentication:
# Install apache2-utils if not already installed
sudo apt-get install apache2-utils
# Create the password file (you'll be prompted for a password)
htpasswd -c .htpasswd llm_user
Security Note: For production deployments, consider using Cloud Run’s Secret Manager integration instead of baking credentials into the image.
Step 3: Create the Startup Script
Create start.sh to orchestrate both services:
#!/bin/bash
set -e
echo "Starting llama-server..."
# Start llama-server in background on port 8081
/app/llama-server \
--host 0.0.0.0 \
--port 8081 \
-m /models/gpt-oss-20b-MXFP4.gguf \
-c 0 \
-ngl 999 &
LLAMA_PID=$!
echo "Waiting for llama-server to be ready..."
# Wait for llama-server to be ready (with timeout)
TIMEOUT=120
ELAPSED=0
until curl -s http://127.0.0.1:8081/health > /dev/null 2>&1; do
if [ $ELAPSED -ge $TIMEOUT ]; then
echo "Timeout waiting for llama-server"
exit 1
fi
sleep 2
ELAPSED=$((ELAPSED + 2))
echo "Waiting... ($ELAPSED seconds)"
done
echo "llama-server is ready!"
echo "Starting nginx..."
# Start nginx in foreground
exec nginx -g 'daemon off;'
Make it executable:
chmod +x start.sh
Key llama-server parameters:
-c 0: Load context size from model (uses model’s maximum)-ngl 999: Offload all layers to GPU
Step 4: Create the Dockerfile
Create the Dockerfile:
# Use llama.cpp's official CUDA server image
FROM ghcr.io/ggml-org/llama.cpp:server-cuda
# Install nginx and required tools
RUN apt-get update && apt-get install -y \
nginx \
curl \
python3 \
python3-pip \
&& pip install huggingface-hub[cli] \
&& rm -rf /var/lib/apt/lists/*
# Download the gpt-oss-20b model from Hugging Face
# Using MXFP4 quantization (12.1 GB) - fits in L4's 24GB VRAM
RUN python3 -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='lmstudio-community/gpt-oss-20b-GGUF', filename='gpt-oss-20b-MXFP4.gguf', local_dir='/models')"
# Copy nginx configuration
COPY nginx.conf /etc/nginx/nginx.conf
# Copy basic auth credentials
COPY .htpasswd /etc/nginx/.htpasswd
RUN chmod 644 /etc/nginx/.htpasswd
# Copy startup script
COPY start.sh /start.sh
RUN chmod +x /start.sh
# Expose port 8080 (Cloud Run default)
EXPOSE 8080
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
# Start both services
ENTRYPOINT ["/start.sh"]
Important Notes:
-
Do not use
--break-system-packageswith pip - the base image uses an older pip version that doesn’t support this flag. -
Use Python’s
hf_hub_downloadfunction instead ofhuggingface-clicommand - the CLI script may not be in PATH after pip install in the container. -
The gpt-oss-20b GGUF model is available from
lmstudio-community/gpt-oss-20b-GGUF(not directly from OpenAI). The MXFP4 quantization is 12.1 GB and fits comfortably in the L4’s 24GB VRAM.
Step 5: Build and Test Locally
Before deploying to Cloud Run, test locally:
# Build the image
docker build -t llama-gpt-oss .
# Run with GPU support
docker run --gpus all -p 8080:8080 llama-gpt-oss
# Test in another terminal
curl -u llm_user:your_password http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss-20b",
"messages": [
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
"max_tokens": 256
}'
Step 6: Deploy to Cloud Run
Deploy your service to Cloud Run with GPU support:
gcloud run deploy llama-gpt-oss \
--source . \
--port 8080 \
--concurrency 4 \
--cpu 8 \
--gpu 1 \
--gpu-type nvidia-l4 \
--max-instances 1 \
--memory 32Gi \
--allow-unauthenticated \
--no-cpu-throttling \
--no-gpu-zonal-redundancy \
--timeout=600
Configuration explained:
| Flag | Value | Purpose |
|---|---|---|
--gpu 1 |
1 | Attach one NVIDIA L4 GPU |
--gpu-type nvidia-l4 |
nvidia-l4 | Specify L4 GPU type (24GB VRAM) |
--cpu 8 |
8 | Required minimum for GPU instances |
--memory 32Gi |
32GB | Sufficient for model + overhead |
--concurrency 4 |
4 | Parallel requests per instance |
--max-instances 1 |
1 | Limit based on GPU quota |
--no-cpu-throttling |
- | Required for GPU workloads |
--timeout 600 |
10 min | Allow long inference requests |
Note: We use --allow-unauthenticated because we handle authentication at the nginx layer. Alternatively, you could use Cloud Run’s IAM authentication and remove nginx basic auth.
Step 7: Test the Deployment
Once deployed, test your endpoint:
# Get the service URL
SERVICE_URL=$(gcloud run services describe llama-gpt-oss --format='value(status.url)')
# Send a test request
curl -u llm_user:your_password "${SERVICE_URL}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss-20b",
"messages": [
{"role": "user", "content": "Write a haiku about cloud computing."}
],
"max_tokens": 500
}'
Note: gpt-oss-20b is a reasoning model that shows its thinking process. Use higher max_tokens values (500+) to get complete responses including the reasoning chain.
Testing with Streaming
llama.cpp supports streaming responses:
curl -u llm_user:your_password "${SERVICE_URL}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss-20b",
"messages": [
{"role": "user", "content": "Tell me a short story."}
],
"stream": true
}'
Alternative: Deploy Mistral-7B-Instruct
If you prefer a smaller, faster model, you can deploy Mistral-7B-Instruct instead. This model is only ~4.4 GB (Q4_K_M quantization) and offers excellent performance for general-purpose chat tasks.
Dockerfile for Mistral-7B
Create a Dockerfile with the following content:
# Use llama.cpp's official CUDA server image
FROM ghcr.io/ggml-org/llama.cpp:server-cuda
# Install nginx and required tools
RUN apt-get update && apt-get install -y \
nginx \
curl \
python3 \
python3-pip \
&& pip install huggingface-hub[cli] \
&& rm -rf /var/lib/apt/lists/*
# Download the Mistral-7B-Instruct model from Hugging Face
# Using Q4_K_M quantization for good balance of quality and memory
RUN python3 -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='TheBloke/Mistral-7B-Instruct-v0.2-GGUF', filename='mistral-7b-instruct-v0.2.Q4_K_M.gguf', local_dir='/models')"
# Copy nginx configuration
COPY nginx.conf /etc/nginx/nginx.conf
# Copy basic auth credentials
COPY .htpasswd /etc/nginx/.htpasswd
RUN chmod 644 /etc/nginx/.htpasswd
# Copy startup script
COPY start.sh /start.sh
RUN chmod +x /start.sh
# Expose port 8080 (Cloud Run default)
EXPOSE 8080
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
# Start both services
ENTRYPOINT ["/start.sh"]
start.sh for Mistral-7B
Update your start.sh to use the Mistral model:
#!/bin/bash
set -e
echo "Starting llama-server..."
# Start llama-server in background on port 8081
/app/llama-server \
--host 0.0.0.0 \
--port 8081 \
-m /models/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
-c 0 \
-ngl 999 &
LLAMA_PID=$!
echo "Waiting for llama-server to be ready..."
# Wait for llama-server to be ready (with timeout)
TIMEOUT=120
ELAPSED=0
until curl -s http://127.0.0.1:8081/health > /dev/null 2>&1; do
if [ $ELAPSED -ge $TIMEOUT ]; then
echo "Timeout waiting for llama-server"
exit 1
fi
sleep 2
ELAPSED=$((ELAPSED + 2))
echo "Waiting... ($ELAPSED seconds)"
done
echo "llama-server is ready!"
echo "Starting nginx..."
# Start nginx in foreground
exec nginx -g 'daemon off;'
Advantages of Mistral-7B
| Aspect | gpt-oss-20b | Mistral-7B-Instruct |
|---|---|---|
| Model Size | 12.1 GB (MXFP4) | 4.4 GB (Q4_K_M) |
| Cold Start | ~30-60 seconds | ~15-30 seconds |
| Inference Speed | Moderate | Fast |
| Reasoning | Advanced (shows chain-of-thought) | Standard |
| Best For | Complex reasoning tasks | General chat, faster responses |
Test Mistral Deployment
curl -u llm_user:your_password "${SERVICE_URL}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-7b",
"messages": [
{"role": "user", "content": "Write a haiku about cloud computing."}
],
"max_tokens": 100
}'
API Endpoints
The llama.cpp server provides OpenAI-compatible endpoints:
| Endpoint | Description |
|---|---|
POST /v1/chat/completions |
Chat completions (OpenAI compatible) |
POST /v1/completions |
Text completions |
POST /v1/embeddings |
Text embeddings |
GET /health |
Health check |
GET /v1/models |
List available models |
Cost Optimization Tips
- Scale to zero: Cloud Run automatically scales to zero when idle
- Use appropriate quantization: Q4_K_M provides good quality at lower memory
- Set appropriate timeouts: Avoid paying for hung requests
- Monitor usage: Use Cloud Monitoring to track GPU utilization
Troubleshooting
Container fails to start
Check Cloud Run logs:
gcloud run services logs read llama-gpt-oss --limit=50
Out of memory errors
- Ensure you’re using a quantized model (Q4_K_M or smaller)
- Reduce context size: change
-c 0to-c 8192in start.sh
Slow cold starts
- The model is embedded in the image for faster starts
- First request after scale-to-zero takes ~30-60 seconds
- Consider using minimum instances (
--min-instances 1) for production
Authentication issues
- Verify
.htpasswdfile is correctly generated - Check nginx logs in Cloud Run console
- Ensure health endpoint is excluded from auth
Model file not found
If you see errors like failed to open GGUF file ... (No such file or directory):
- Verify the model filename matches between Dockerfile and start.sh
- Use
hf_hub_downloadPython function instead ofhuggingface-clicommand - Check that the HuggingFace repo and filename are correct
Cleanup
To avoid ongoing charges, delete the resources when done:
# Delete the Cloud Run service
gcloud run services delete llama-gpt-oss
# Delete container images from Artifact Registry
gcloud artifacts docker images delete \
REGION-docker.pkg.dev/PROJECT_ID/cloud-run-source-deploy/llama-gpt-oss
Conclusion
You now have a fully functional LLM deployment on GCP Cloud Run with:
- GPU-accelerated inference using NVIDIA L4
- OpenAI-compatible API endpoints
- Basic authentication via nginx
- Automatic scaling (including scale-to-zero)
- Health checks for reliability
This setup provides a cost-effective way to run your own LLM inference endpoint with the flexibility of serverless infrastructure.