OpenRouter-Compatible API Gateway
A FastAPI-based gateway that serves OpenRouter-compatible traffic, fans out to local and remote LLM adapters, and exposes observability interfaces for operations. In production the application runs as a Docker container on port 8080 behind Nginx; for local development it can run on any port via uvicorn directly.
Architecture
hybridInference/
├── docs/ # Deployment and integration guides
├── serving/
│ ├── servers/
│ │ ├── app.py # FastAPI entry point (exposes /v1/*)
│ │ ├── bootstrap.py # Service bootstrap: models, routing, DB
│ │ └── routers/ # API routers (health, models, completions, admin, ...)
│ ├── adapters/ # Provider adapters: openai_compat.py (vllm/sglang/
│ │ # ollama/chutes/featherless/deepseek/zhipu/minimax),
│ │ # openrouter.py, gemini.py, anthropic.py, claude.py,
│ │ # plus shared profiles.py
│ ├── storage/ # PostgreSQL or Cloudflare D1 stores (DB_BACKEND-driven)
│ ├── observability/ # Structured request logging
│ └── utils/ # Logging, configuration helpers
├── routing/ # Routing manager and execution strategies
├── config/
│ ├── models.yaml # Canonical model definitions + adapters
│ └── routing.yaml (optional) # Weighted routing configuration
└── deploy/docker/ # Dockerfiles and docker-compose.yml
Key Components
FastAPI app (
serving.servers.app:create_app): Hosts OpenRouter-compatible endpoints plus admin and metrics routes.Bootstrap (
serving.servers.bootstrap): Loads environment, registers models, applies routing weights, and wires database logging.Adapters (
serving.adapters.*): Translate requests to providers —OpenAICompatAdapter(vLLM, SGLang, Ollama, DeepSeek, Zhipu, MiniMax, Chutes, Featherless),OpenRouterAdapter,GeminiAdapter,AnthropicAdapter,ClaudeAdapter(Vertex).Routing (
routing.*): Supports fixed-ratio and future strategies for splitting traffic across adapters.Observability (
serving.observability): Structured request logging.
Features
OpenRouter API compatibility: Implements
/v1/chat/completions,/v1/models, and related schemas.Hybrid routing: Combine local VLLM workers with hosted APIs.
Resilient adapters: Automatic retry/fallback when a provider returns errors.
Usage accounting: Prompt/completion token tracking and persisted request logs.
Streaming responses: Server-Sent Events (SSE) for incremental output.
Observability hooks: Structured request logs (PostgreSQL or Cloudflare D1, selected by
DB_BACKEND).
Development Setup
Prerequisites
Python 3.10-3.13 (3.12 recommended)
uv (recommended) or conda
Create Environment
# Clone and bootstrap
git clone <repository-url>
cd hybridInference
uv venv -p 3.12
source .venv/bin/activate
uv sync
Local Environment Variables
Create .env from the template:
cp .env.example .env
Populate it with provider credentials and runtime configuration:
LOCAL_BASE_URL=https://freeinference.org/v1
DEEPSEEK_API_KEY=your-deepseek-api-key
GEMINI_API_KEY=your-gemini-api-key
DB_BACKEND=postgres # or "d1" for Cloudflare D1
Run Locally
# Development server with reload on port 8080
uvicorn serving.servers.app:app --reload --host 0.0.0.0 --port 8080
# Alternate: respect PORT env var
PORT=9000 uvicorn serving.servers.app:app --host 0.0.0.0 --port $PORT
When the app starts it will:
Load environment variables (dotenv).
Register models from
config/models.yaml.Apply routing overrides from
config/routing.yamlif present.Initialize the database logger and operational store. Backend is chosen by
DB_BACKEND(postgresby default; set tod1to use Cloudflare D1, in which case the Postgres init step is skipped — seeserving/servers/bootstrap.py).
Quick Checks
# Health
curl http://localhost:8080/health
# Models (OpenRouter schema)
curl http://localhost:8080/v1/models | jq
# Chat completion
env \
http_proxy= \
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Ping"}],
"max_tokens": 64
}'
Production Deployment
All services run via Docker Compose. Nginx on the host terminates TLS; Cloudflare provides CDN and DDoS protection in front of Nginx.
make up # Start all services
make ps # Verify health
Runtime operations:
Restart:
make restartormake restart s=backendLogs:
make logsormake logs s=backend
See Deployment for the full guide.
Health:
curl https://freeinference.org/health
API Surface
Method |
Path |
Auth |
Description |
|---|---|---|---|
GET |
|
API key |
Enumerate available models with OpenRouter metadata |
POST |
|
API key |
OpenRouter/OpenAI-compatible chat completion |
GET |
|
Public |
Liveness and dependency checks |
GET |
|
Public |
Current routing weights for each model |
GET |
|
Admin |
Admin-authenticated alias of |
GET |
|
Admin |
Aggregated usage statistics. The previous unauthenticated |
Example Requests
# Streaming response
env \
http_proxy= \
curl -N -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-chat",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Describe the architecture."}
],
"stream": true,
"temperature": 0.7,
"max_tokens": 256
}'
Logging and Metrics
The database backend is selected by DB_BACKEND in .env, read by serving/servers/bootstrap.py:
PostgreSQL (default):
DB_BACKEND=postgres(or unset). Logs and operational state go to thepostgrescontainer defined indeploy/docker/docker-compose.yml. Connection parameters come fromDB_HOST,DB_PORT,DB_NAME,DB_USER,DB_PASSWORD.Cloudflare D1:
DB_BACKEND=d1. RequiresD1_ACCOUNT_ID,D1_DATABASE_ID, andD1_API_TOKEN. The Postgres init step is skipped entirely.Dual-write (migration mode): When
D1_DUAL_WRITE=1, writes go to both backends to support migration; reads still come from D1.Metrics: Prometheus instrumentation has been removed; structured logs in the configured database are the supported observability surface today.
Inspect logs (PostgreSQL backend):
docker exec -it hybridinference-postgres psql -U $DB_USER -d $DB_NAME \
-c "SELECT model_id, COUNT(*) FROM api_logs GROUP BY model_id;"
Testing
# Fast unit/integration tests
pytest -m "not external" -q
# Focused server tests
pytest tests/servers/test_bootstrap.py -q
Troubleshooting
Port already in use:
sudo lsof -ti :80 | xargs sudo kill -9Missing models: Verify
config/models.yamlcontains the expected entries and thatLOCAL_BASE_URLis reachable.No logs written: Confirm
DB_BACKENDis set correctly and the chosen backend is reachable (Postgres healthy or D1 credentials valid).