Self-Hosted AI Stack: Ollama, Open WebUI, and LiteLLM in Production
Why Self-Host AI Inference
Running LLMs locally is no longer a hobbyist experiment. With models like DeepSeek V4, Qwen 3.6, Llama 4, and Mistral Small 3.1 fitting on consumer GPUs, self-hosted AI has become a viable alternative to API-based services for many workloads:
- Cost at scale: API costs grow linearly with tokens. Self-hosted inference is a fixed hardware cost.
- Data privacy: No data leaves your network — critical for regulated industries.
- Latency: Local inference is 10-50ms vs 500-2000ms for cloud APIs at the 7B-70B parameter range.
- Model flexibility: Run any open-weight model, fine-tune, swap, experiment without vendor lock-in.
Architecture

Core Components
Ollama — Local Inference Engine
Ollama has become the standard for running LLMs locally. It handles model downloading, GPU acceleration (NVIDIA CUDA), and exposes a REST API for inference.
Key decisions:
- Model storage: Models live at
~/.ollama/modelsby default. Map this to a persistent volume — downloading 70B models (40+ GB) repeatedly is painful. - GPU passthrough: NVIDIA container toolkit is required for GPU acceleration. Without it, CPU inference on a 70B model takes seconds per token.
- Concurrent requests: Ollama handles one request per model by default. For multi-user setups, increase the
--num-parallelflag or use LiteLLM's routing.
Model selection guide:
| Model | Parameters | VRAM | Quality | Speed (7B on 3090) |
|---|---|---|---|---|
| Qwen 3.6 | 7.6B | 6 GB | Good | ~80 tok/s |
| Mistral Small 3.1 | 24B | 16 GB | Very Good | ~40 tok/s |
| DeepSeek V4 | 67B | 42 GB | Excellent | ~15 tok/s |
| Llama 4 | 8B/70B | 6 GB / 42 GB | Good/Excellent | ~75 tok/s / ~12 tok/s |
LiteLLM — The API Gateway
LiteLLM provides an OpenAI-compatible API that routes to any LLM backend. This is the key architectural decision: instead of connecting Open WebUI directly to Ollama, route through LiteLLM.
Why this matters:
- Multi-model routing: LiteLLM can load-balance across models, fall back on errors, and route by cost or latency.
- OpenAI drop-in: Any tool that uses OpenAI's API (Cursor, Continue.dev, Open Interpreter) can point to your LiteLLM endpoint instead.
- Usage tracking: LiteLLM logs token usage per user, per model — essential for cost allocation in team setups.
- Rate limiting: Prevent a single user from saturating the GPU with bulk inference.
The config file (litellm_config.yaml) defines model groups and routing rules:
model_list:
- model_name: fast
litellm_params:
model: ollama/mistral-s3.1
api_base: http://ollama:11434
rpm: 60
- model_name: powerful
litellm_params:
model: ollama/deepseek-v4
api_base: http://ollama:11434
rpm: 10
- model_name: default
litellm_params:
model: ollama/qwen3.6
api_base: http://ollama:11434
Open WebUI — The User Interface
Open WebUI is the most feature-complete ChatGPT alternative for self-hosted setups. It provides:
- Conversation management: Threads, search history, export
- RAG support: Upload documents and query them with any model
- User management: Multi-user with role-based access
- Tool support: Web search, code execution, image generation (via ComfyUI integration)
The UI connects to LiteLLM's OpenAI-compatible endpoint, so it inherits all routing and fallback behavior automatically.
Hardware Requirements
| Setup | GPU | RAM (System) | Storage | Users |
|---|---|---|---|---|
| Minimum | None (CPU) | 16 GB | 20 GB | 1 |
| Recommended | RTX 3090 24GB | 32 GB | 100 GB | 1-3 |
| Team | 2× RTX 4090 24GB | 64 GB | 500 GB | 5-15 |
| Production | 4× A100 80GB | 256 GB | 2 TB | 50+ |
Without a GPU, expect 1-5 tokens/second on 7B models — usable for chat, painful for batch processing.
Production Considerations
Model Management
Don't pull every model at once. A single 70B model occupies ~40 GB of VRAM for inference plus ~40 GB of disk. Start with one small + one large model:
# Start lean
docker compose exec ollama ollama pull mistral-s3.1 # 16 GB disk, 24B params
docker compose exec ollama ollama pull qwen3.6 # 6 GB disk, 7.6B params
# Add when needed
# docker compose exec ollama ollama pull deepseek-v4 # 42 GB disk, 67B params
Security
The stack exposes a UI and an API. For production:
- Put it behind the SSL Reverse Proxy stack (HTTPS + WAF)
- Set
WEBUI_SECRET_KEYandWEBUI_JWT_SECRETin Open WebUI - Enable user authentication (default: on)
- Set
LITELLM_MASTER_KEYfor API access control - Restrict model access — not all users need 70B inference
Monitoring
LiteLLM exposes Prometheus metrics at /metrics. Add this endpoint to the Observability stack's Prometheus scrape config for:
- Token usage per model and per user
- Request latency by model
- Error rates and rate limit hits
- Model fallback frequency
Key Takeaways
- LiteLLM is the critical layer: It decouples the UI from the inference engine, enabling model routing, fallbacks, and rate limiting without touching either component
- GPU is non-negotiable for acceptable UX: CPU inference is feasible but frustrating for interactive use
- Start with Qwen 3.6 (7B) for speed and Mistral Small 3.1 (24B) for quality — upgrade to DeepSeek V4 when you need maximum capability
- The stack scales horizontally: Add more Ollama instances behind LiteLLM as demand grows