Deployment Guide
Guide for deploying HybridInference in production.
Production Deployment
Using systemd
The recommended way to deploy HybridInference is using systemd for both the backend API and the frontend app.
Backend API service
Install dependencies:
cd hybridInference uv venv -p 3.10 source .venv/bin/activate uv sync
Create systemd unit file:
sudo cp infrastructure/systemd/hybrid_inference.service /etc/systemd/system/
Configure environment:
Edit
/etc/systemd/system/hybrid_inference.serviceand update:WorkingDirectoryUserEnvironment variables
Start the service:
sudo systemctl daemon-reload sudo systemctl enable hybrid_inference.service sudo systemctl start hybrid_inference.service
Check status:
sudo systemctl status hybrid_inference.service journalctl -u hybrid_inference.service -f
Frontend service (Next.js)
For the FreeInference web UI, we recommend running the Next.js frontend as a separate
systemd service on port 3001 and putting Nginx in front of it.
Build the production frontend:
cd hybridInference/frontend npm install npm run build
Install the systemd unit:
sudo cp infrastructure/systemd/freeinference-frontend.service /etc/systemd/system/
Edit the unit if needed:
Update
/etc/systemd/system/freeinference-frontend.service:UserandGroupWorkingDirectoryExecStart(Node.js path) if your Node.js binary is not in the default location
Enable and start the frontend:
sudo systemctl daemon-reload sudo systemctl enable freeinference-frontend.service sudo systemctl start freeinference-frontend.service sudo systemctl status freeinference-frontend.service
With this setup, the backend API listens on 127.0.0.1:8080 and the frontend on
127.0.0.1:3001. The next section shows how to expose both securely via Nginx and HTTPS.
Environment Variables
Required environment variables for production:
# API Keys
DEEPSEEK_API_KEY=your-key
GEMINI_API_KEY=your-key
LLAMA_API_KEY=your-key
# Database
DB_NAME=hybridinference
DB_USER=postgres
DB_PASSWORD=your-secure-password
DB_HOST=localhost
DB_PORT=5432
# Local vLLM (optional)
LOCAL_BASE_URL=http://localhost:8000/v1
# Rate limiting (optional)
RATE_LIMIT_PER_MINUTE=100
Nginx and HTTPS (optional but recommended)
In production we recommend putting Nginx in front of the backend and frontend to:
Terminate TLS (HTTPS).
Serve the frontend on the root path (
/).Route API traffic to the backend.
An example configuration is provided in infrastructure/nginx/freeinference.conf. Typical
deployment steps on Ubuntu/Debian:
sudo cp infrastructure/nginx/freeinference.conf /etc/nginx/sites-available/freeinference.conf
sudo ln -s /etc/nginx/sites-available/freeinference.conf /etc/nginx/sites-enabled/freeinference.conf
sudo nginx -t
sudo systemctl reload nginx
This configuration assumes:
Backend API:
127.0.0.1:8080Frontend Next.js:
127.0.0.1:3001Public domain:
freeinference.orgHTTPS certificates from Let’s Encrypt (see comments in the config file).
If you run behind Cloudflare, set SSL/TLS mode to Full (strict) so that Cloudflare connects to Nginx over HTTPS and avoids redirect loops on port 80.
Health Checks
Monitor service health:
curl http://localhost:8080/health
Logs
View logs:
# Follow logs
journalctl -u hybrid_inference.service -f
# View recent logs
journalctl -u hybrid_inference.service -n 100
Monitoring
Prometheus Metrics
Metrics are exposed at /metrics:
curl http://localhost:80/metrics
Key metrics:
http_requests_total- Total HTTP requestshttp_request_duration_seconds- Request latencymodel_requests_total- Requests per modelmodel_errors_total- Errors per model
Grafana Dashboards
Import the dashboard from infrastructure/grafana/.
Database Setup
PostgreSQL
Create database:
CREATE DATABASE hybridinference; CREATE USER hybridinference WITH PASSWORD 'your-password'; GRANT ALL PRIVILEGES ON DATABASE hybridinference TO hybridinference;
Configure connection:
Update
.envwith database credentials.
See the Database guide in this section: Database.
Troubleshooting
Service won’t start
Check logs:
journalctl -u hybrid_inference.service -n 50
Common issues:
Missing API keys
Database connection failed
Port already in use
High latency
Check:
Database performance
Provider API latency
Resource usage (CPU/memory)
Rate limiting
Adjust rate limits in configuration:
rate_limits:
requests_per_minute: 100
tokens_per_minute: 100000