# Architecture Overview

HybridInference is designed as a modular, high-performance inference gateway.

## System Architecture

```
┌─────────────┐
│   Clients   │
└──────┬──────┘
       │
       ▼
┌─────────────────┐
│  FastAPI Gateway│
│   (serving/)    │
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
┌────────┐ ┌─────────┐
│Routing │ │Adapters │
│Manager │ │         │
└────┬───┘ └────┬────┘
     │          │
     ▼          ▼
┌───────────────────────┐
│    LLM Providers      │
│ ┌────────────────────────┐ │
│ │ vLLM / SGLang (local)  │ │
│ │ Ollama (local/remote)  │ │
│ │ OpenAI-compat APIs     │ │
│ │ (DeepSeek, Zhipu,      │ │
│ │  Chutes, Featherless)  │ │
│ │ OpenRouter             │ │
│ │ Gemini API             │ │
│ │ Anthropic API          │ │
│ └────────────────────────┘ │
└───────────────────────┘
```

## Network Layer

External traffic passes through three layers before reaching the application logic:

```
Client ──▶ Cloudflare (CDN + DDoS) ──▶ Nginx (:443) ──▶ FastAPI (:8080)
                                                    └──▶ Frontend (:3001)
```

- **Cloudflare**: Edge CDN, DDoS protection, SSL termination (Full strict mode).
- **Nginx**: Origin TLS, path-based routing, body size limits, WebSocket upgrade.
- **FastAPI**: API authentication, model routing, observability.

The network layer handles external connectivity and request delivery. The sections below describe the internal inference pipeline that runs inside FastAPI.

## Core Components

### Serving Layer (`serving/`)

The serving layer provides the FastAPI-based gateway:

- **Gateway**: HTTP API endpoints for inference requests
- **Adapters**: Provider-specific API adapters
- **Observability**: Logging, metrics, and tracing
- **Storage**: PostgreSQL integration for request/response logging

### Routing Layer (`routing/`)

Intelligent routing and load balancing:

- **Manager**: Routes requests to optimal providers
- **Strategies**: Different routing algorithms (round-robin, cost-based, latency-based)
- **Health Checks**: Monitor provider availability and performance

### Configuration (`config/`)

Centralized configuration management:

- Model configurations
- Provider settings
- Routing policies
- Feature flags

### Deploy (`deploy/`)

Deployment and observability:

- Docker Compose service definitions and Dockerfiles
- Alertmanager and alert logger

## Key Design Principles

1. **Modularity**: Clear separation between serving, routing, and provider layers
2. **Extensibility**: Easy to add new providers and routing strategies
3. **Observability**: Comprehensive logging and metrics at every layer
4. **Performance**: Optimized for low-latency, high-throughput inference
5. **Reliability**: Health checks, retries, and fallback mechanisms

## Data Flow

1. Client sends inference request to Gateway
2. Gateway validates and preprocesses request
3. Routing Manager selects optimal provider
4. Adapter translates request to provider-specific format
5. Provider processes inference
6. Response is logged to PostgreSQL
7. Response is returned to client