In this post, I’ll walk you through deploying an LLM (specifically OpenAI’s gpt-oss-20b) on Google Cloud Run with GPU support using llama.cpp. We’ll also add nginx as a reverse proxy with basic authentication for added security.

Why Cloud Run with GPUs?

Google Cloud Run now supports NVIDIA L4 GPUs, making it an excellent choice for deploying LLMs with:

  • Serverless scaling - Scale to zero when not in use
  • Pay-per-use pricing - Only pay for actual compute time
  • Simple deployment - No infrastructure management
  • Fast cold starts - GPU instances start in ~5 seconds

Prerequisites

Before we begin, ensure you have:

  1. A Google Cloud Platform account with billing enabled
  2. The gcloud CLI installed and configured
  3. Docker installed locally (for testing)
  4. Required IAM roles:
    • Artifact Registry Admin
    • Cloud Build Editor
    • Cloud Run Admin
    • Service Account User
    • Storage Admin

Request GPU Quota

First, request GPU quota for Cloud Run:

# Visit the quota page and request "Total Nvidia L4 GPU allocation, per project per region"
# https://g.co/cloudrun/gpu-quota

Enable Required APIs

gcloud services enable \
    artifactregistry.googleapis.com \
    cloudbuild.googleapis.com \
    run.googleapis.com \
    storage.googleapis.com

Configure gcloud

gcloud config set project YOUR_PROJECT_ID
gcloud config set run/region europe-west1  # or us-central1

Architecture Overview

Our deployment uses the following architecture:

Client Request
      ↓
   nginx (port 8080)
   [Basic Auth]
      ↓
llama-server (port 8081)
   [GPU Inference]

Nginx handles authentication and proxies requests to the llama.cpp server running on the same container.

Project Structure

Create a new directory for your deployment:

mkdir llama-cloud-run && cd llama-cloud-run

Your final directory structure will look like this:

llama-cloud-run/
├── Dockerfile
├── nginx.conf
├── .htpasswd
└── start.sh

Step 1: Create the Nginx Configuration

Create nginx.conf to configure nginx as a reverse proxy with basic authentication:

events {
    worker_connections 1024;
}

http {
    server {
        listen 8080;
        
        # Basic authentication
        auth_basic "Restricted Access";
        auth_basic_user_file /etc/nginx/.htpasswd;
        
        # Health check endpoint (no auth required for Cloud Run health checks)
        location /health {
            auth_basic off;
            proxy_pass http://127.0.0.1:8081/health;
        }
        
        # Proxy all other requests to llama-server
        location / {
            proxy_pass http://127.0.0.1:8081;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header Connection "";
            
            # Disable buffering for streaming responses
            proxy_buffering off;
            
            # Long timeout for LLM inference
            proxy_read_timeout 600s;
            proxy_connect_timeout 60s;
            proxy_send_timeout 600s;
        }
    }
}

Key configuration points:

  • Port 8080: Cloud Run’s default expected port
  • Health endpoint without auth: Cloud Run needs unauthenticated access to /health for health checks
  • Disabled buffering: Essential for streaming LLM responses
  • Long timeouts: LLM inference can take time, especially for long outputs

Step 2: Create Basic Authentication Credentials

Generate the .htpasswd file for basic authentication:

# Install apache2-utils if not already installed
sudo apt-get install apache2-utils

# Create the password file (you'll be prompted for a password)
htpasswd -c .htpasswd llm_user

Security Note: For production deployments, consider using Cloud Run’s Secret Manager integration instead of baking credentials into the image.

Step 3: Create the Startup Script

Create start.sh to orchestrate both services:

#!/bin/bash
set -e

echo "Starting llama-server..."

# Start llama-server in background on port 8081
/app/llama-server \
    --host 0.0.0.0 \
    --port 8081 \
    -m /models/gpt-oss-20b-MXFP4.gguf \
    -c 0 \
    -ngl 999 &

LLAMA_PID=$!

echo "Waiting for llama-server to be ready..."

# Wait for llama-server to be ready (with timeout)
TIMEOUT=120
ELAPSED=0
until curl -s http://127.0.0.1:8081/health > /dev/null 2>&1; do
    if [ $ELAPSED -ge $TIMEOUT ]; then
        echo "Timeout waiting for llama-server"
        exit 1
    fi
    sleep 2
    ELAPSED=$((ELAPSED + 2))
    echo "Waiting... ($ELAPSED seconds)"
done

echo "llama-server is ready!"
echo "Starting nginx..."

# Start nginx in foreground
exec nginx -g 'daemon off;'

Make it executable:

chmod +x start.sh

Key llama-server parameters:

  • -c 0: Load context size from model (uses model’s maximum)
  • -ngl 999: Offload all layers to GPU

Step 4: Create the Dockerfile

Create the Dockerfile:

# Use llama.cpp's official CUDA server image
FROM ghcr.io/ggml-org/llama.cpp:server-cuda

# Install nginx and required tools
RUN apt-get update && apt-get install -y \
    nginx \
    curl \
    python3 \
    python3-pip \
    && pip install huggingface-hub[cli] \
    && rm -rf /var/lib/apt/lists/*

# Download the gpt-oss-20b model from Hugging Face
# Using MXFP4 quantization (12.1 GB) - fits in L4's 24GB VRAM
RUN python3 -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='lmstudio-community/gpt-oss-20b-GGUF', filename='gpt-oss-20b-MXFP4.gguf', local_dir='/models')"

# Copy nginx configuration
COPY nginx.conf /etc/nginx/nginx.conf

# Copy basic auth credentials
COPY .htpasswd /etc/nginx/.htpasswd
RUN chmod 644 /etc/nginx/.htpasswd

# Copy startup script
COPY start.sh /start.sh
RUN chmod +x /start.sh

# Expose port 8080 (Cloud Run default)
EXPOSE 8080

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

# Start both services
ENTRYPOINT ["/start.sh"]

Important Notes:

  1. Do not use --break-system-packages with pip - the base image uses an older pip version that doesn’t support this flag.

  2. Use Python’s hf_hub_download function instead of huggingface-cli command - the CLI script may not be in PATH after pip install in the container.

  3. The gpt-oss-20b GGUF model is available from lmstudio-community/gpt-oss-20b-GGUF (not directly from OpenAI). The MXFP4 quantization is 12.1 GB and fits comfortably in the L4’s 24GB VRAM.

Step 5: Build and Test Locally

Before deploying to Cloud Run, test locally:

# Build the image
docker build -t llama-gpt-oss .

# Run with GPU support
docker run --gpus all -p 8080:8080 llama-gpt-oss

# Test in another terminal
curl -u llm_user:your_password http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "gpt-oss-20b",
        "messages": [
            {"role": "user", "content": "Explain quantum computing in simple terms."}
        ],
        "max_tokens": 256
    }'

Step 6: Deploy to Cloud Run

Deploy your service to Cloud Run with GPU support:

gcloud run deploy llama-gpt-oss \
    --source . \
    --port 8080 \
    --concurrency 4 \
    --cpu 8 \
    --gpu 1 \
    --gpu-type nvidia-l4 \
    --max-instances 1 \
    --memory 32Gi \
    --allow-unauthenticated \
    --no-cpu-throttling \
    --no-gpu-zonal-redundancy \
    --timeout=600

Configuration explained:

Flag Value Purpose
--gpu 1 1 Attach one NVIDIA L4 GPU
--gpu-type nvidia-l4 nvidia-l4 Specify L4 GPU type (24GB VRAM)
--cpu 8 8 Required minimum for GPU instances
--memory 32Gi 32GB Sufficient for model + overhead
--concurrency 4 4 Parallel requests per instance
--max-instances 1 1 Limit based on GPU quota
--no-cpu-throttling - Required for GPU workloads
--timeout 600 10 min Allow long inference requests

Note: We use --allow-unauthenticated because we handle authentication at the nginx layer. Alternatively, you could use Cloud Run’s IAM authentication and remove nginx basic auth.

Step 7: Test the Deployment

Once deployed, test your endpoint:

# Get the service URL
SERVICE_URL=$(gcloud run services describe llama-gpt-oss --format='value(status.url)')

# Send a test request
curl -u llm_user:your_password "${SERVICE_URL}/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "gpt-oss-20b",
        "messages": [
            {"role": "user", "content": "Write a haiku about cloud computing."}
        ],
        "max_tokens": 500
    }'

Note: gpt-oss-20b is a reasoning model that shows its thinking process. Use higher max_tokens values (500+) to get complete responses including the reasoning chain.

Testing with Streaming

llama.cpp supports streaming responses:

curl -u llm_user:your_password "${SERVICE_URL}/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "gpt-oss-20b",
        "messages": [
            {"role": "user", "content": "Tell me a short story."}
        ],
        "stream": true
    }'

Alternative: Deploy Mistral-7B-Instruct

If you prefer a smaller, faster model, you can deploy Mistral-7B-Instruct instead. This model is only ~4.4 GB (Q4_K_M quantization) and offers excellent performance for general-purpose chat tasks.

Dockerfile for Mistral-7B

Create a Dockerfile with the following content:

# Use llama.cpp's official CUDA server image
FROM ghcr.io/ggml-org/llama.cpp:server-cuda

# Install nginx and required tools
RUN apt-get update && apt-get install -y \
    nginx \
    curl \
    python3 \
    python3-pip \
    && pip install huggingface-hub[cli] \
    && rm -rf /var/lib/apt/lists/*

# Download the Mistral-7B-Instruct model from Hugging Face
# Using Q4_K_M quantization for good balance of quality and memory
RUN python3 -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='TheBloke/Mistral-7B-Instruct-v0.2-GGUF', filename='mistral-7b-instruct-v0.2.Q4_K_M.gguf', local_dir='/models')"

# Copy nginx configuration
COPY nginx.conf /etc/nginx/nginx.conf

# Copy basic auth credentials
COPY .htpasswd /etc/nginx/.htpasswd
RUN chmod 644 /etc/nginx/.htpasswd

# Copy startup script
COPY start.sh /start.sh
RUN chmod +x /start.sh

# Expose port 8080 (Cloud Run default)
EXPOSE 8080

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

# Start both services
ENTRYPOINT ["/start.sh"]

start.sh for Mistral-7B

Update your start.sh to use the Mistral model:

#!/bin/bash
set -e

echo "Starting llama-server..."

# Start llama-server in background on port 8081
/app/llama-server \
    --host 0.0.0.0 \
    --port 8081 \
    -m /models/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
    -c 0 \
    -ngl 999 &

LLAMA_PID=$!

echo "Waiting for llama-server to be ready..."

# Wait for llama-server to be ready (with timeout)
TIMEOUT=120
ELAPSED=0
until curl -s http://127.0.0.1:8081/health > /dev/null 2>&1; do
    if [ $ELAPSED -ge $TIMEOUT ]; then
        echo "Timeout waiting for llama-server"
        exit 1
    fi
    sleep 2
    ELAPSED=$((ELAPSED + 2))
    echo "Waiting... ($ELAPSED seconds)"
done

echo "llama-server is ready!"
echo "Starting nginx..."

# Start nginx in foreground
exec nginx -g 'daemon off;'

Advantages of Mistral-7B

Aspect gpt-oss-20b Mistral-7B-Instruct
Model Size 12.1 GB (MXFP4) 4.4 GB (Q4_K_M)
Cold Start ~30-60 seconds ~15-30 seconds
Inference Speed Moderate Fast
Reasoning Advanced (shows chain-of-thought) Standard
Best For Complex reasoning tasks General chat, faster responses

Test Mistral Deployment

curl -u llm_user:your_password "${SERVICE_URL}/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "mistral-7b",
        "messages": [
            {"role": "user", "content": "Write a haiku about cloud computing."}
        ],
        "max_tokens": 100
    }'

API Endpoints

The llama.cpp server provides OpenAI-compatible endpoints:

Endpoint Description
POST /v1/chat/completions Chat completions (OpenAI compatible)
POST /v1/completions Text completions
POST /v1/embeddings Text embeddings
GET /health Health check
GET /v1/models List available models

Cost Optimization Tips

  1. Scale to zero: Cloud Run automatically scales to zero when idle
  2. Use appropriate quantization: Q4_K_M provides good quality at lower memory
  3. Set appropriate timeouts: Avoid paying for hung requests
  4. Monitor usage: Use Cloud Monitoring to track GPU utilization

Troubleshooting

Container fails to start

Check Cloud Run logs:

gcloud run services logs read llama-gpt-oss --limit=50

Out of memory errors

  • Ensure you’re using a quantized model (Q4_K_M or smaller)
  • Reduce context size: change -c 0 to -c 8192 in start.sh

Slow cold starts

  • The model is embedded in the image for faster starts
  • First request after scale-to-zero takes ~30-60 seconds
  • Consider using minimum instances (--min-instances 1) for production

Authentication issues

  • Verify .htpasswd file is correctly generated
  • Check nginx logs in Cloud Run console
  • Ensure health endpoint is excluded from auth

Model file not found

If you see errors like failed to open GGUF file ... (No such file or directory):

  • Verify the model filename matches between Dockerfile and start.sh
  • Use hf_hub_download Python function instead of huggingface-cli command
  • Check that the HuggingFace repo and filename are correct

Cleanup

To avoid ongoing charges, delete the resources when done:

# Delete the Cloud Run service
gcloud run services delete llama-gpt-oss

# Delete container images from Artifact Registry
gcloud artifacts docker images delete \
    REGION-docker.pkg.dev/PROJECT_ID/cloud-run-source-deploy/llama-gpt-oss

Conclusion

You now have a fully functional LLM deployment on GCP Cloud Run with:

  • GPU-accelerated inference using NVIDIA L4
  • OpenAI-compatible API endpoints
  • Basic authentication via nginx
  • Automatic scaling (including scale-to-zero)
  • Health checks for reliability

This setup provides a cost-effective way to run your own LLM inference endpoint with the flexibility of serverless infrastructure.

References