Architecture Overview

HybridInference is designed as a modular, high-performance inference gateway.

System Architecture

┌─────────────┐
│   Clients   │
└──────┬──────┘
       │
       ▼
┌─────────────────┐
│  FastAPI Gateway│
│   (serving/)    │
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
┌────────┐ ┌─────────┐
│Routing │ │Adapters │
│Manager │ │         │
└────┬───┘ └────┬────┘
     │          │
     ▼          ▼
┌───────────────────────┐
│    LLM Providers      │
│ ┌────────────────────────┐ │
│ │ vLLM / SGLang (local)  │ │
│ │ Ollama (local/remote)  │ │
│ │ OpenAI-compat APIs     │ │
│ │ (DeepSeek, Zhipu,      │ │
│ │  Chutes, Featherless)  │ │
│ │ OpenRouter             │ │
│ │ Gemini API             │ │
│ │ Anthropic API          │ │
│ └────────────────────────┘ │
└───────────────────────┘

Network Layer

External traffic passes through three layers before reaching the application logic:

Client ──▶ Cloudflare (CDN + DDoS) ──▶ Nginx (:443) ──▶ FastAPI (:8080)
                                                    └──▶ Frontend (:3001)

Cloudflare: Edge CDN, DDoS protection, SSL termination (Full strict mode).
Nginx: Origin TLS, path-based routing, body size limits, WebSocket upgrade.
FastAPI: API authentication, model routing, observability.

The network layer handles external connectivity and request delivery. The sections below describe the internal inference pipeline that runs inside FastAPI.

Core Components

Serving Layer (`serving/`)

The serving layer provides the FastAPI-based gateway:

Gateway: HTTP API endpoints for inference requests
Adapters: Provider-specific API adapters
Observability: Logging, metrics, and tracing
Storage: PostgreSQL integration for request/response logging

Routing Layer (`routing/`)

Intelligent routing and load balancing:

Manager: Routes requests to optimal providers
Strategies: Different routing algorithms (round-robin, cost-based, latency-based)
Health Checks: Monitor provider availability and performance

Configuration (`config/`)

Centralized configuration management:

Model configurations
Provider settings
Routing policies
Feature flags

Deploy (`deploy/`)

Deployment and observability:

Docker Compose service definitions and Dockerfiles
Alertmanager and alert logger

Key Design Principles

Modularity: Clear separation between serving, routing, and provider layers
Extensibility: Easy to add new providers and routing strategies
Observability: Comprehensive logging and metrics at every layer
Performance: Optimized for low-latency, high-throughput inference
Reliability: Health checks, retries, and fallback mechanisms

Data Flow

Client sends inference request to Gateway
Gateway validates and preprocesses request
Routing Manager selects optimal provider
Adapter translates request to provider-specific format
Provider processes inference
Response is logged to PostgreSQL
Response is returned to client