FreeInference Deployment

Cloudflare + Nginx + FastAPI (current)

Traffic flows through three layers before reaching the application:

Client ──▶ Cloudflare ──▶ Nginx (:443) ──▶ FastAPI  (:8080)
                                      ├──▶ pgAdmin  (:5050)   [admin-only]
                                      └──▶ Frontend (:3001)

Layer

Role

Cloudflare

CDN, DDoS protection, edge SSL termination. SSL/TLS mode set to Full (strict) so Cloudflare verifies the origin certificate. CF-Connecting-IP header carries the real client IP.

Nginx

TLS termination (Let’s Encrypt cert), path-based routing (see below), per-location body size limits (/v1/ is bumped to 50 MB to accommodate large completion payloads and Qdrant vector upserts via the /v1/qdrant proxy; everything else uses the Nginx 1 MB default), WebSocket upgrade.

FastAPI

API logic — request authentication, model routing, backpressure, Qdrant proxy, and observability. Listens on 127.0.0.1:8080.

Nginx path routing:

  • /v1/, /auth/, /user/, /admin/, /internal/playground/ → FastAPI

  • /pgadmin/ → pgAdmin — gated by auth_request against FastAPI’s /internal/verify-admin endpoint, so only admins reach it

  • everything else → frontend

Docker Compose manages all services (backend, frontend, PostgreSQL, Alertmanager, alert-logger, plus pgAdmin behind the admin profile) with automatic restarts via restart: unless-stopped.

Deployment

All services are defined in deploy/docker/docker-compose.yml. From the project root:

cp .env.example .env   # Configure secrets
make up                # Start all services
make ps                # Verify health

Nginx runs on the host (not containerized) for SSL termination. See Deployment for the full guide.

Runtime Operations

  • Restart: make restart or make restart s=backend

  • Follow logs: make logs or make logs s=backend

  • Health check: curl https://freeinference.org/health

  • List registered models: curl https://freeinference.org/v1/models | jq

Why Nginx Is Back

Nginx was briefly removed (see Legacy section below) when FreeInference was API-only and Cloudflare handled all edge concerns. It was re-introduced when we added:

  • Frontend: The Next.js web UI runs on port 3001 and needs to share the freeinference.org domain with the API. Path-based routing (/v1/* → backend, /* → frontend) is a natural fit for Nginx.

  • Body size limits: Qdrant vector upserts can be large. Nginx’s client_max_body_size gives a clear, configurable gate before traffic hits FastAPI.

  • WebSocket upgrade: Nginx handles the Upgrade / Connection headers cleanly for SSE and WebSocket-based streaming.

Legacy Architectures

FastAPI direct (v3, abandoned)

We previously served OpenRouter-compatible traffic directly through FastAPI listening on port 80, without Nginx. This was simpler but could not support frontend co-hosting or fine-grained body size limits. Once the frontend was added, we moved back to Nginx.

Nginx (v2, abandoned)

We briefly fronted FastAPI (running on port 8080) with vanilla Nginx that listened on port 80 (redirecting to HTTPS) and terminated TLS on port 443 for https://freeinference.org. Once Cloudflare took over edge SSL duties, the extra hop mostly added deployment and observability complexity without material benefit, so the setup was removed.

Nginx + Lua via OpenResty (v1, abandoned)

We previously relied on OpenResty (Nginx + Lua) to provide a production routing tier across multiple LLM backends. The stack handled model mapping, load balancing, health checks, and error handling. We keep the installation notes for posterity.

Overview

┌─────────────┐      ┌──────────────────┐      ┌─────────────────┐
│   Client    │─────▶│  OpenResty       │─────▶│  Backend 1      │
│  (API Call)         (Router)                (Qwen@8000)    │
└─────────────┘                              └─────────────────┘
                       - Model Mapping                        - Load Balancing│      ┌─────────────────┐
                       - Health Checks │─────▶│  Backend 2                             - Error Handling│        (Test@8001)                        └──────────────────┘      └─────────────────┘

Installation Notes

# Add repository
wget -O - https://openresty.org/package/pubkey.gpg | sudo apt-key add -
echo "deb http://openresty.org/package/ubuntu $(lsb_release -sc) main" | \
    sudo tee /etc/apt/sources.list.d/openresty.list

# Install
sudo apt-get update
sudo apt-get install openresty
# Create directory
sudo mkdir -p /usr/local/openresty/nginx/conf/sites-available
sudo mkdir -p /usr/local/openresty/nginx/conf/sites-enabled

# Copy Config file
sudo cp <your config file> /usr/local/openresty/nginx/conf/sites-available/vllm

# Enable the site
sudo ln -s /usr/local/openresty/nginx/conf/sites-available/vllm \
           /usr/local/openresty/nginx/conf/sites-enabled/vllm
http {
    # ... Others ...

    # Lua settings
    lua_package_path "/usr/local/openresty/lualib/?.lua;;";
    lua_shared_dict model_cache 10m;

    # Include Site Configuration
    include /usr/local/openresty/nginx/conf/sites-enabled/*;
}
# test openresty config
sudo openresty -t

# Start
sudo systemctl start openresty

# Enable auto-start
sudo systemctl enable openresty

# reload openresty
sudo openresty -s reload

The model paths below are historical and may no longer match the registry; query /v1/models for the currently registered models.

# Chat with Qwen3-Coder
curl -X POST http://freeinference.org/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/models/Qwen_Qwen3-Coder-480B-A35B-Instruct-FP8", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

# Chat with llama4-scout
curl -X POST http://freeinference.org/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/models/meta-llama_Llama-4-Scout-17B-16E", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 50}'

Nginx (v0, abandoned)

sudo vim /etc/nginx/sites-available/vllm
sudo nginx -t
sudo systemctl reload nginx

# to test the endpoint
curl https://freeinference.org/v1/models