# Adding a New Local Model This guide explains how to register a self-hosted model behind the HybridInference gateway. Use this when the model is already served by a local OpenAI-compatible server such as vLLM, SGLang, Ollama, or a custom `/v1/chat/completions` service. For new remote providers or custom adapters, see [Adding a New Model](adding-models.md). ## Overview Adding a local model has three parts: 1. Start the local inference server. 2. Add a `config/models.yaml` entry that points to that server. 3. Restart HybridInference and verify the public `/v1` API. The local server must expose OpenAI-compatible endpoints. The gateway forwards chat requests to `/v1/chat/completions` and embedding requests to `/v1/embeddings` when the model is registered as an embedding model. ## Private Server (No Public Internet) If your model runs on a different server that is not exposed to the public internet, keep it private and make the gateway reach it over trusted network paths. Recommended options: ```yaml route: - kind: openai_compat weight: 1.0 base_url: "http://10.0.12.34:8000/v1" provider_model_id: "your-served-model-name" ``` Example SSH reverse tunnel (internal host -> FreeInference host): ```bash # Run this on the INTERNAL model host ssh -N -R 8001:127.0.0.1:8000 jason@freeinference.org ``` Then set: ```yaml base_url: "http://127.0.0.1:8001/v1" # resolved on the FreeInference host ``` For reverse-tunnel setups, verify from the FreeInference side: ```bash curl http://127.0.0.1:8001/v1/models | jq ``` ## Step 1: Start the Local Model Server Start the model with your preferred serving runtime. Example with vLLM: ```bash vllm serve Qwen/Qwen3.5-27B \ --host 0.0.0.0 \ --port 8007 \ --served-model-name Qwen3.5-27B ``` Check that the local server responds before changing the gateway config: ```bash curl http://localhost:8007/v1/models | jq curl -s -X POST http://localhost:8007/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen3.5-27B", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 32 }' | jq ``` If HybridInference runs in Docker, use `http://host.docker.internal:` in `config/models.yaml` so the container can reach the host. If it runs directly on the host, `http://localhost:` is fine. ## Step 2: Add the Model to `config/models.yaml` Add a new entry under `models:`. Keep the public `id` short and stable because clients use it in the `model` field. ```yaml - id: qwen3.5-27b name: Qwen3.5 27B provider: sglang quantization: "unknown" input_modalities: ["text"] output_modalities: ["text"] context_length: 65536 max_output_length: 8192 supports_tools: true supports_structured_output: true supported_params: [temperature, top_p, max_tokens, stop, stream] aliases: ["Qwen3.5-27B"] pricing: prompt: "0" completion: "0" image: "0" request: "0" input_cache_reads: "0" input_cache_writes: "0" route: - kind: sglang weight: 1.0 base_url: "http://host.docker.internal:8007" provider_model_id: "Qwen3.5-27B" pricing: prompt: "0" completion: "0" ``` Use these fields carefully: - `id`: Public model ID returned by `/v1/models` and used by clients. - `provider`: Top-level provider label for metadata. For local OpenAI-compatible servers, use `vllm`, `sglang`, or `openai_compat`. - `route[].kind`: Adapter kind used by the gateway. Local OpenAI-compatible services can use `vllm`, `sglang`, or `openai_compat`. - `base_url`: The local server root. It may include `/v1`, but does not have to. - `provider_model_id`: Model name sent to the local server. This must match the serving runtime's model name. - `aliases`: Optional extra public names that resolve to the same gateway model. - `supported_params`: Only include parameters that the local runtime accepts. ## Step 3: Add Optional Remote Fallbacks If you want automatic fallback, add another route with a lower or equal weight: ```yaml route: - kind: sglang weight: 1.0 base_url: "http://host.docker.internal:8007" provider_model_id: "Qwen3.5-27B" pricing: prompt: "0" completion: "0" - kind: featherless weight: 0 base_url: ${FEATHERLESS_BASE_URL} api_key: ${FEATHERLESS_API_KEY} provider_model_id: "Qwen/Qwen3.5-27B" pricing: prompt: "0" completion: "0" ``` Set fallback `weight` to `0` when you want to keep the route configured but disabled. Set it above `0` to allow weighted routing and failover. ## Step 4: Restart the Gateway Restart the backend so it reloads `config/models.yaml`: ```bash make restart s=backend ``` For local development without Docker: ```bash uvicorn serving.servers.app:app --host 0.0.0.0 --port 8080 ``` ## Step 5: Verify Through HybridInference List registered models: ```bash curl http://localhost:8080/v1/models | jq ``` Run a chat completion through the gateway: ```bash curl -s -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3.5-27b", "messages": [{"role": "user", "content": "Hello from the gateway"}], "max_tokens": 32 }' | jq ``` Test streaming: ```bash curl -N -s -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3.5-27b", "messages": [{"role": "user", "content": "Stream one sentence"}], "stream": true, "max_tokens": 64 }' ``` ## Routing Notes If `config/routing.yaml` is present, it can adjust route weights after models are registered. Without `routing.yaml`, the gateway uses the weights in `config/models.yaml`. When `OFFLOAD=1`, startup removes local adapters whose `base_url` matches `LOCAL_BASE_URL`. Use this for remote-only incidents, and make sure any local route you want offloaded uses the same base URL value as `LOCAL_BASE_URL`. ## Troubleshooting ### Model Does Not Appear in `/v1/models` - Check YAML indentation under `models:`. - Restart the backend after editing `config/models.yaml`. - Confirm `id` and `aliases` do not collide with another model. - Check backend logs for model registry errors. ### Gateway Cannot Reach the Local Server - From Docker, use `host.docker.internal` instead of `localhost`. - From bare metal, use `localhost` or the host IP. - For private remote servers, use private IP/hostname or a private tunnel endpoint; avoid public internet exposure. - Confirm the local server listens on `0.0.0.0`, not only `127.0.0.1`, if it must be reached from a container. - Verify `curl /v1/models` works from the same environment as the backend. ### Requests Fail After Registration - Make sure `provider_model_id` matches the model name exposed by the local runtime. - Remove unsupported request params from `supported_params`. - If the runtime's base URL already ends in `/v1`, keep it that way; the adapter will use `/chat/completions` under that base. - For tool calls or JSON output, set `supports_tools` and `supports_structured_output` only when the local runtime supports them.