# OpenRouter-Compatible API Gateway

A FastAPI-based gateway that serves OpenRouter-compatible traffic, fans out to local and remote LLM adapters, and exposes observability interfaces for operations. In production the application runs as a Docker container on port 8080 behind Nginx; for local development it can run on any port via `uvicorn` directly.

## Architecture

```
hybridInference/
├── docs/                       # Deployment and integration guides
├── serving/
│   ├── servers/
│   │   ├── app.py              # FastAPI entry point (exposes /v1/*)
│   │   ├── bootstrap.py        # Service bootstrap: models, routing, DB
│   │   └── routers/            # API routers (health, models, completions, admin, ...)
│   ├── adapters/               # Provider adapters: openai_compat.py (vllm/sglang/
│   │                           #   ollama/chutes/featherless/deepseek/zhipu/minimax),
│   │                           #   openrouter.py, gemini.py, anthropic.py, claude.py,
│   │                           #   plus shared profiles.py
│   ├── storage/                # PostgreSQL or Cloudflare D1 stores (DB_BACKEND-driven)
│   ├── observability/          # Structured request logging
│   └── utils/                  # Logging, configuration helpers
├── routing/                    # Routing manager and execution strategies
├── config/
│   ├── models.yaml             # Canonical model definitions + adapters
│   └── routing.yaml (optional) # Weighted routing configuration
└── deploy/docker/      # Dockerfiles and docker-compose.yml
```

### Key Components
- **FastAPI app (`serving.servers.app:create_app`)**: Hosts OpenRouter-compatible endpoints plus admin and metrics routes.
- **Bootstrap (`serving.servers.bootstrap`)**: Loads environment, registers models, applies routing weights, and wires database logging.
- **Adapters (`serving.adapters.*`)**: Translate requests to providers — `OpenAICompatAdapter` (vLLM, SGLang, Ollama, DeepSeek, Zhipu, MiniMax, Chutes, Featherless), `OpenRouterAdapter`, `GeminiAdapter`, `AnthropicAdapter`, `ClaudeAdapter` (Vertex).
- **Routing (`routing.*`)**: Supports fixed-ratio and future strategies for splitting traffic across adapters.
- **Observability (`serving.observability`)**: Structured request logging.

## Features

- **OpenRouter API compatibility**: Implements `/v1/chat/completions`, `/v1/models`, and related schemas.
- **Hybrid routing**: Combine local VLLM workers with hosted APIs.
- **Resilient adapters**: Automatic retry/fallback when a provider returns errors.
- **Usage accounting**: Prompt/completion token tracking and persisted request logs.
- **Streaming responses**: Server-Sent Events (SSE) for incremental output.
- **Observability hooks**: Structured request logs (PostgreSQL or Cloudflare D1, selected by `DB_BACKEND`).

## Development Setup

### Prerequisites
- Python 3.10-3.13 (3.12 recommended)
- [uv](https://github.com/astral-sh/uv) (recommended) or conda

### Create Environment
```bash
# Clone and bootstrap
git clone <repository-url>
cd hybridInference
uv venv -p 3.12
source .venv/bin/activate
uv sync
```

### Local Environment Variables
Create `.env` from the template:
```bash
cp .env.example .env
```
Populate it with provider credentials and runtime configuration:
```bash
LOCAL_BASE_URL=https://freeinference.org/v1
DEEPSEEK_API_KEY=your-deepseek-api-key
GEMINI_API_KEY=your-gemini-api-key
DB_BACKEND=postgres   # or "d1" for Cloudflare D1
```

### Run Locally
```bash
# Development server with reload on port 8080
uvicorn serving.servers.app:app --reload --host 0.0.0.0 --port 8080

# Alternate: respect PORT env var
PORT=9000 uvicorn serving.servers.app:app --host 0.0.0.0 --port $PORT
```

When the app starts it will:
1. Load environment variables (dotenv).
2. Register models from `config/models.yaml`.
3. Apply routing overrides from `config/routing.yaml` if present.
4. Initialize the database logger and operational store. Backend is chosen by `DB_BACKEND` (`postgres` by default; set to `d1` to use Cloudflare D1, in which case the Postgres init step is skipped — see `serving/servers/bootstrap.py`).

### Quick Checks
```bash
# Health
curl http://localhost:8080/health

# Models (OpenRouter schema)
curl http://localhost:8080/v1/models | jq

# Chat completion
env \
  http_proxy= \
  curl -X POST http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
          "messages": [{"role": "user", "content": "Ping"}],
          "max_tokens": 64
        }'
```

## Production Deployment

All services run via Docker Compose. Nginx on the host terminates TLS;
Cloudflare provides CDN and DDoS protection in front of Nginx.

```bash
make up      # Start all services
make ps      # Verify health
```

Runtime operations:
- Restart: `make restart` or `make restart s=backend`
- Logs: `make logs` or `make logs s=backend`

See [Deployment](deployment.md) for the full guide.
- Health: `curl https://freeinference.org/health`

## API Surface

| Method | Path | Auth | Description |
| ------ | ---- | ---- | ----------- |
| GET | `/v1/models` | API key | Enumerate available models with OpenRouter metadata |
| POST | `/v1/chat/completions` | API key | OpenRouter/OpenAI-compatible chat completion |
| GET | `/health` | Public | Liveness and dependency checks |
| GET | `/routing` | Public | Current routing weights for each model |
| GET | `/admin/routing` | Admin | Admin-authenticated alias of `/routing` |
| GET | `/admin/stats` | Admin | Aggregated usage statistics. The previous unauthenticated `/stats` alias has been removed; use this endpoint instead. |

### Example Requests
```bash
# Streaming response
env \
  http_proxy= \
  curl -N -X POST http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
          "model": "deepseek-chat",
          "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Describe the architecture."}
          ],
          "stream": true,
          "temperature": 0.7,
          "max_tokens": 256
        }'
```

## Logging and Metrics

The database backend is selected by `DB_BACKEND` in `.env`, read by `serving/servers/bootstrap.py`:

- **PostgreSQL (default)**: `DB_BACKEND=postgres` (or unset). Logs and operational state go to the `postgres` container defined in `deploy/docker/docker-compose.yml`. Connection parameters come from `DB_HOST`, `DB_PORT`, `DB_NAME`, `DB_USER`, `DB_PASSWORD`.
- **Cloudflare D1**: `DB_BACKEND=d1`. Requires `D1_ACCOUNT_ID`, `D1_DATABASE_ID`, and `D1_API_TOKEN`. The Postgres init step is skipped entirely.
- **Dual-write (migration mode)**: When `D1_DUAL_WRITE=1`, writes go to both backends to support migration; reads still come from D1.
- **Metrics**: Prometheus instrumentation has been removed; structured logs in the configured database are the supported observability surface today.

Inspect logs (PostgreSQL backend):
```bash
docker exec -it hybridinference-postgres psql -U $DB_USER -d $DB_NAME \
  -c "SELECT model_id, COUNT(*) FROM api_logs GROUP BY model_id;"
```

## Testing

```bash
# Fast unit/integration tests
pytest -m "not external" -q

# Focused server tests
pytest tests/servers/test_bootstrap.py -q
```

## Troubleshooting

- **Port already in use**: `sudo lsof -ti :80 | xargs sudo kill -9`
- **Missing models**: Verify `config/models.yaml` contains the expected entries and that `LOCAL_BASE_URL` is reachable.
- **No logs written**: Confirm `DB_BACKEND` is set correctly and the chosen backend is reachable (Postgres healthy or D1 credentials valid).

## Related Docs

- [FreeInference Deployment](freeinference.md): Current production deployment architecture (Cloudflare + Nginx + FastAPI) and historical deployment iterations.
- [Routing](routing.md): Detailed routing manager configuration and strategy extension guide.
- [Adding Models](adding-models.md): How to add new models (YAML) and integrate new providers (adapter) in one place.