# Architecture Overview HybridInference is designed as a modular, high-performance inference gateway. ## System Architecture ``` ┌─────────────┐ │ Clients │ └──────┬──────┘ │ ▼ ┌─────────────────┐ │ FastAPI Gateway│ │ (serving/) │ └────────┬────────┘ │ ┌────┴────┐ │ │ ▼ ▼ ┌────────┐ ┌─────────┐ │Routing │ │Adapters │ │Manager │ │ │ └────┬───┘ └────┬────┘ │ │ ▼ ▼ ┌───────────────────────┐ │ LLM Providers │ │ ┌────────────────────────┐ │ │ │ vLLM / SGLang (local) │ │ │ │ Ollama (local/remote) │ │ │ │ OpenAI-compat APIs │ │ │ │ (DeepSeek, Zhipu, │ │ │ │ Chutes, Featherless) │ │ │ │ OpenRouter │ │ │ │ Gemini API │ │ │ │ Anthropic API │ │ │ └────────────────────────┘ │ └───────────────────────┘ ``` ## Network Layer External traffic passes through three layers before reaching the application logic: ``` Client ──▶ Cloudflare (CDN + DDoS) ──▶ Nginx (:443) ──▶ FastAPI (:8080) └──▶ Frontend (:3001) ``` - **Cloudflare**: Edge CDN, DDoS protection, SSL termination (Full strict mode). - **Nginx**: Origin TLS, path-based routing, body size limits, WebSocket upgrade. - **FastAPI**: API authentication, model routing, observability. The network layer handles external connectivity and request delivery. The sections below describe the internal inference pipeline that runs inside FastAPI. ## Core Components ### Serving Layer (`serving/`) The serving layer provides the FastAPI-based gateway: - **Gateway**: HTTP API endpoints for inference requests - **Adapters**: Provider-specific API adapters - **Observability**: Logging, metrics, and tracing - **Storage**: PostgreSQL integration for request/response logging ### Routing Layer (`routing/`) Intelligent routing and load balancing: - **Manager**: Routes requests to optimal providers - **Strategies**: Different routing algorithms (round-robin, cost-based, latency-based) - **Health Checks**: Monitor provider availability and performance ### Configuration (`config/`) Centralized configuration management: - Model configurations - Provider settings - Routing policies - Feature flags ### Deploy (`deploy/`) Deployment and observability: - Docker Compose service definitions and Dockerfiles - Alertmanager and alert logger ## Key Design Principles 1. **Modularity**: Clear separation between serving, routing, and provider layers 2. **Extensibility**: Easy to add new providers and routing strategies 3. **Observability**: Comprehensive logging and metrics at every layer 4. **Performance**: Optimized for low-latency, high-throughput inference 5. **Reliability**: Health checks, retries, and fallback mechanisms ## Data Flow 1. Client sends inference request to Gateway 2. Gateway validates and preprocesses request 3. Routing Manager selects optimal provider 4. Adapter translates request to provider-specific format 5. Provider processes inference 6. Response is logged to PostgreSQL 7. Response is returned to client