OpenRouter-Compatible API Gateway

A FastAPI-based gateway that serves OpenRouter-compatible traffic, fans out to local and remote LLM adapters, and exposes observability interfaces for operations. In production the application runs as a Docker container on port 8080 behind Nginx; for local development it can run on any port via uvicorn directly.

Architecture

hybridInference/
├── docs/                       # Deployment and integration guides
├── serving/
│   ├── servers/
│   │   ├── app.py              # FastAPI entry point (exposes /v1/*)
│   │   ├── bootstrap.py        # Service bootstrap: models, routing, DB
│   │   └── routers/            # API routers (health, models, completions, admin, ...)
│   ├── adapters/               # Provider adapters: openai_compat.py (vllm/sglang/
│   │                           #   ollama/chutes/featherless/deepseek/zhipu/minimax),
│   │                           #   openrouter.py, gemini.py, anthropic.py, claude.py,
│   │                           #   plus shared profiles.py
│   ├── storage/                # PostgreSQL or Cloudflare D1 stores (DB_BACKEND-driven)
│   ├── observability/          # Structured request logging
│   └── utils/                  # Logging, configuration helpers
├── routing/                    # Routing manager and execution strategies
├── config/
│   ├── models.yaml             # Canonical model definitions + adapters
│   └── routing.yaml (optional) # Weighted routing configuration
└── deploy/docker/      # Dockerfiles and docker-compose.yml

Key Components

FastAPI app (serving.servers.app:create_app): Hosts OpenRouter-compatible endpoints plus admin and metrics routes.
Bootstrap (serving.servers.bootstrap): Loads environment, registers models, applies routing weights, and wires database logging.
Adapters (serving.adapters.*): Translate requests to providers — OpenAICompatAdapter (vLLM, SGLang, Ollama, DeepSeek, Zhipu, MiniMax, Chutes, Featherless), OpenRouterAdapter, GeminiAdapter, AnthropicAdapter, ClaudeAdapter (Vertex).
Routing (routing.*): Supports fixed-ratio and future strategies for splitting traffic across adapters.
Observability (serving.observability): Structured request logging.

Features

OpenRouter API compatibility: Implements /v1/chat/completions, /v1/models, and related schemas.
Hybrid routing: Combine local VLLM workers with hosted APIs.
Resilient adapters: Automatic retry/fallback when a provider returns errors.
Usage accounting: Prompt/completion token tracking and persisted request logs.
Streaming responses: Server-Sent Events (SSE) for incremental output.
Observability hooks: Structured request logs (PostgreSQL or Cloudflare D1, selected by DB_BACKEND).

Development Setup

Prerequisites

Python 3.10-3.13 (3.12 recommended)
uv (recommended) or conda

Create Environment

# Clone and bootstrap
git clone <repository-url>
cd hybridInference
uv venv -p 3.12
source .venv/bin/activate
uv sync

Local Environment Variables

Create .env from the template:

cp .env.example .env

Populate it with provider credentials and runtime configuration:

LOCAL_BASE_URL=https://freeinference.org/v1
DEEPSEEK_API_KEY=your-deepseek-api-key
GEMINI_API_KEY=your-gemini-api-key
DB_BACKEND=postgres   # or "d1" for Cloudflare D1

Run Locally

# Development server with reload on port 8080
uvicorn serving.servers.app:app --reload --host 0.0.0.0 --port 8080

# Alternate: respect PORT env var
PORT=9000 uvicorn serving.servers.app:app --host 0.0.0.0 --port $PORT

When the app starts it will:

Load environment variables (dotenv).
Register models from config/models.yaml.
Apply routing overrides from config/routing.yaml if present.
Initialize the database logger and operational store. Backend is chosen by DB_BACKEND (postgres by default; set to d1 to use Cloudflare D1, in which case the Postgres init step is skipped — see serving/servers/bootstrap.py).

Quick Checks

# Health
curl http://localhost:8080/health

# Models (OpenRouter schema)
curl http://localhost:8080/v1/models | jq

# Chat completion
env \
  http_proxy= \
  curl -X POST http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
          "messages": [{"role": "user", "content": "Ping"}],
          "max_tokens": 64
        }'

Production Deployment

All services run via Docker Compose. Nginx on the host terminates TLS; Cloudflare provides CDN and DDoS protection in front of Nginx.

make up      # Start all services
make ps      # Verify health

Runtime operations:

Restart: make restart or make restart s=backend
Logs: make logs or make logs s=backend

See Deployment for the full guide.

Health: curl https://freeinference.org/health

API Surface

Method	Path	Auth	Description
GET	`/v1/models`	API key	Enumerate available models with OpenRouter metadata
POST	`/v1/chat/completions`	API key	OpenRouter/OpenAI-compatible chat completion
GET	`/health`	Public	Liveness and dependency checks
GET	`/routing`	Public	Current routing weights for each model
GET	`/admin/routing`	Admin	Admin-authenticated alias of `/routing`
GET	`/admin/stats`	Admin	Aggregated usage statistics. The previous unauthenticated `/stats` alias has been removed; use this endpoint instead.

Example Requests

# Streaming response
env \
  http_proxy= \
  curl -N -X POST http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
          "model": "deepseek-chat",
          "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Describe the architecture."}
          ],
          "stream": true,
          "temperature": 0.7,
          "max_tokens": 256
        }'

Logging and Metrics

The database backend is selected by DB_BACKEND in .env, read by serving/servers/bootstrap.py:

PostgreSQL (default): DB_BACKEND=postgres (or unset). Logs and operational state go to the postgres container defined in deploy/docker/docker-compose.yml. Connection parameters come from DB_HOST, DB_PORT, DB_NAME, DB_USER, DB_PASSWORD.
Cloudflare D1: DB_BACKEND=d1. Requires D1_ACCOUNT_ID, D1_DATABASE_ID, and D1_API_TOKEN. The Postgres init step is skipped entirely.
Dual-write (migration mode): When D1_DUAL_WRITE=1, writes go to both backends to support migration; reads still come from D1.
Metrics: Prometheus instrumentation has been removed; structured logs in the configured database are the supported observability surface today.

Inspect logs (PostgreSQL backend):

docker exec -it hybridinference-postgres psql -U $DB_USER -d $DB_NAME \
  -c "SELECT model_id, COUNT(*) FROM api_logs GROUP BY model_id;"

Testing

# Fast unit/integration tests
pytest -m "not external" -q

# Focused server tests
pytest tests/servers/test_bootstrap.py -q

Troubleshooting

Port already in use: sudo lsof -ti :80 | xargs sudo kill -9
Missing models: Verify config/models.yaml contains the expected entries and that LOCAL_BASE_URL is reachable.
No logs written: Confirm DB_BACKEND is set correctly and the chosen backend is reachable (Postgres healthy or D1 credentials valid).