Configuration Guide
The system uses two configuration files for runtime setup:
config/models.yaml(required): Registers available models and their backend adaptersconfig/routing.yaml(optional): Configures traffic distribution between local and remote deployments
These files are independent: models.yaml provides the candidate set of adapters, while routing.yaml adjusts weights on top of registered adapters. Without routing.yaml, the system uses default weights from models.yaml or environment variables (typically 1.0).
1. Environment Variables and Priority
Variable Substitution
The system supports environment variable placeholders in YAML:
${VAR}: ReadsVARfrom environment${VAR:-default}: ReadsVARfrom environment, usesdefaultif not found
Configuration Priority
Environment variables (.env or system) > routing.yaml/models.yaml > code defaults
2. models.yaml (Required)
Registers models and adapters at startup. Each entry describes a model identifier and its backend configuration (base_url, api_key, capabilities, aliases, etc.).
Example:
# config/models.yaml
models:
- id: my-local-model
name: My Local Model
provider: ollama
provider_model_id: "my-local-model"
context_length: 65536
max_output_length: 8192
supports_tools: true
supports_structured_output: true
supported_params: [temperature, top_p, max_tokens, stop]
pricing:
prompt: "0"
completion: "0"
image: "0"
request: "0"
input_cache_reads: "0"
input_cache_writes: "0"
route:
- kind: ollama
weight: 1.0
base_url: "http://localhost:11434/v1"
provider_model_id: "my-local-model"
- id: my-remote-model
name: My Remote Model
provider: openai_compat
provider_model_id: "actual-model-id"
context_length: 131072
max_output_length: 8192
supports_tools: true
supports_structured_output: true
supported_params: [temperature, top_p, max_tokens, stop]
pricing:
prompt: "1.0"
completion: "3.0"
image: "0"
request: "0"
input_cache_reads: "0"
input_cache_writes: "0"
route:
- kind: openai_compat
weight: 1.0
base_url: ${PROVIDER_BASE_URL}
api_key: ${PROVIDER_API_KEY}
provider_model_id: "actual-model-id"
Key Points:
id: Public model ID exposed by the API (what clients use to call the model)provider_model_id: The actual model name sent to the backend provider (e.g., vLLM/freeinference’s/models/...). If omitted, usesidaliases: Additional public aliases that are registered alongsideidto point to the same adapterprovider: Determines adapter type. Supported kinds (dispatched inserving/servers/registry.py:_make_adapter):openai_compat,vllm,sglang,ollama,chutes,featherless,deepseek,zhipu,minimax,openrouter(alsoopenrouter[<slug>]to pin a sub-provider),gemini,claude,anthropic. See adding-models.md for the full reference table./v1/modelsendpoint dynamically generates its response from registered adapters
3. routing.yaml (Optional)
Controls traffic distribution between local and remote deployments with optional health monitoring.
Fixed-Ratio Strategy
Set
routing_strategy: fixedControl local traffic percentage via
routing_parameter.local_fraction(0.0–1.0)Weights are distributed equally within local and remote groups
Health Checking (Optional)
health_check: N: Sends GET request to/healthevery N secondsUnhealthy endpoints temporarily get weight 0
Endpoints recover automatically when health checks succeed
Set to 0 or omit to disable health checking
Example: Hybrid Deployment (60% local / 40% remote)
# config/routing.yaml
routing_strategy: fixed
routing_parameter:
local_fraction: 0.6
timeout: 2
health_check: 30
logging:
output: output.log
local_deployment:
- endpoint: ${LOCAL_BASE_URL:-http://localhost:8000}
models:
remote_deployment:
models:
How It Works:
At startup,
RoutingManagerapplies 60/40 weights to registered adaptersIf local endpoint becomes unhealthy, weights adjust automatically (0% local, 100% remote)
System falls back gracefully to maintain service availability
Local-Only Deployment
Simply omit routing.yaml to use default weights from models.yaml (typically 1.0 for all adapters).
4. Running the System
Set Environment Variables:
export LOCAL_BASE_URL=http://localhost:8000
Start the Server:
# Use uvicorn (no __main__ block in serving.servers.app)
uvicorn serving.servers.app:app --host 0.0.0.0 --port 8080
Verify Operation:
GET /v1/models- Returns available modelsPOST /v1/chat/completions- Routes requests based on configured ratiosGET /routing- Shows current routing configuration
5. FAQ
Q: What if routing.yaml conflicts with models.yaml?
A: routing.yaml only adjusts weights; it doesn’t add/remove adapters. The candidate set comes from models.yaml and environment variables.
Q: How to disable health checks?
A: Set health_check: 0 or omit the field entirely.
Q: Can I use other routing strategies? A: Two routing layers exist with different scopes:
The deployment-wide weight strategy in
config/routing.yaml(routing_strategy:) currently only supportsfixed.RoutingManager.apply()returns without applying weights for any other value (seerouting/manager.py).A per-model
routewisestrategy is also built-in. Opt in by settingrouting_strategy: routewiseon a model entry inconfig/models.yaml; tuning parameters live inconfig/routewise.yaml. RouteWise provides cost-aware routing via a primal-dual decision algorithm.
To add new deployment-wide strategies, implement them in routing/strategies.py and extend the dispatch in routing/manager.py.
Q: What happens during failover? A: The system automatically tries alternative adapters when the primary fails, ensuring continuous service availability.