Architecture Overview
HybridInference is designed as a modular, high-performance inference gateway.
System Architecture
┌─────────────┐
│ Clients │
└──────┬──────┘
│
▼
┌─────────────────┐
│ FastAPI Gateway│
│ (serving/) │
└────────┬────────┘
│
┌────┴────┐
│ │
▼ ▼
┌────────┐ ┌─────────┐
│Routing │ │Adapters │
│Manager │ │ │
└────┬───┘ └────┬────┘
│ │
▼ ▼
┌───────────────────────┐
│ LLM Providers │
│ ┌────────────────────────┐ │
│ │ vLLM / SGLang (local) │ │
│ │ Ollama (local/remote) │ │
│ │ OpenAI-compat APIs │ │
│ │ (DeepSeek, Zhipu, │ │
│ │ Chutes, Featherless) │ │
│ │ OpenRouter │ │
│ │ Gemini API │ │
│ │ Anthropic API │ │
│ └────────────────────────┘ │
└───────────────────────┘
Network Layer
External traffic passes through three layers before reaching the application logic:
Client ──▶ Cloudflare (CDN + DDoS) ──▶ Nginx (:443) ──▶ FastAPI (:8080)
└──▶ Frontend (:3001)
Cloudflare: Edge CDN, DDoS protection, SSL termination (Full strict mode).
Nginx: Origin TLS, path-based routing, body size limits, WebSocket upgrade.
FastAPI: API authentication, model routing, observability.
The network layer handles external connectivity and request delivery. The sections below describe the internal inference pipeline that runs inside FastAPI.
Core Components
Serving Layer (serving/)
The serving layer provides the FastAPI-based gateway:
Gateway: HTTP API endpoints for inference requests
Adapters: Provider-specific API adapters
Observability: Logging, metrics, and tracing
Storage: PostgreSQL integration for request/response logging
Routing Layer (routing/)
Intelligent routing and load balancing:
Manager: Routes requests to optimal providers
Strategies: Different routing algorithms (round-robin, cost-based, latency-based)
Health Checks: Monitor provider availability and performance
Configuration (config/)
Centralized configuration management:
Model configurations
Provider settings
Routing policies
Feature flags
Deploy (deploy/)
Deployment and observability:
Docker Compose service definitions and Dockerfiles
Alertmanager and alert logger
Key Design Principles
Modularity: Clear separation between serving, routing, and provider layers
Extensibility: Easy to add new providers and routing strategies
Observability: Comprehensive logging and metrics at every layer
Performance: Optimized for low-latency, high-throughput inference
Reliability: Health checks, retries, and fallback mechanisms
Data Flow
Client sends inference request to Gateway
Gateway validates and preprocesses request
Routing Manager selects optimal provider
Adapter translates request to provider-specific format
Provider processes inference
Response is logged to PostgreSQL
Response is returned to client