Self-Hosted AI Stack: Ollama, Open WebUI, and LiteLLM in Production | DevOps

Why Self-Host AI Inference

Running LLMs locally is no longer a hobbyist experiment. With models like DeepSeek V4, Qwen 3.6, Llama 4, and Mistral Small 3.1 fitting on consumer GPUs, self-hosted AI has become a viable alternative to API-based services for many workloads:

Cost at scale: API costs grow linearly with tokens. Self-hosted inference is a fixed hardware cost.
Data privacy: No data leaves your network — critical for regulated industries.
Latency: Local inference is 10-50ms vs 500-2000ms for cloud APIs at the 7B-70B parameter range.
Model flexibility: Run any open-weight model, fine-tune, swap, experiment without vendor lock-in.

Architecture

Core Components

Ollama — Local Inference Engine

Ollama has become the standard for running LLMs locally. It handles model downloading, GPU acceleration (NVIDIA CUDA), and exposes a REST API for inference.

Key decisions:

Model storage: Models live at ~/.ollama/models by default. Map this to a persistent volume — downloading 70B models (40+ GB) repeatedly is painful.
GPU passthrough: NVIDIA container toolkit is required for GPU acceleration. Without it, CPU inference on a 70B model takes seconds per token.
Concurrent requests: Ollama handles one request per model by default. For multi-user setups, increase the --num-parallel flag or use LiteLLM's routing.

Model selection guide:

Model	Parameters	VRAM	Quality	Speed (7B on 3090)
Qwen 3.6	7.6B	6 GB	Good	~80 tok/s
Mistral Small 3.1	24B	16 GB	Very Good	~40 tok/s
DeepSeek V4	67B	42 GB	Excellent	~15 tok/s
Llama 4	8B/70B	6 GB / 42 GB	Good/Excellent	~75 tok/s / ~12 tok/s

LiteLLM — The API Gateway

LiteLLM provides an OpenAI-compatible API that routes to any LLM backend. This is the key architectural decision: instead of connecting Open WebUI directly to Ollama, route through LiteLLM.

Why this matters:

Multi-model routing: LiteLLM can load-balance across models, fall back on errors, and route by cost or latency.
OpenAI drop-in: Any tool that uses OpenAI's API (Cursor, Continue.dev, Open Interpreter) can point to your LiteLLM endpoint instead.
Usage tracking: LiteLLM logs token usage per user, per model — essential for cost allocation in team setups.
Rate limiting: Prevent a single user from saturating the GPU with bulk inference.

The config file (litellm_config.yaml) defines model groups and routing rules:

model_list:
  - model_name: fast
    litellm_params:
      model: ollama/mistral-s3.1
      api_base: http://ollama:11434
      rpm: 60

  - model_name: powerful
    litellm_params:
      model: ollama/deepseek-v4
      api_base: http://ollama:11434
      rpm: 10

  - model_name: default
    litellm_params:
      model: ollama/qwen3.6
      api_base: http://ollama:11434

Open WebUI — The User Interface

Open WebUI is the most feature-complete ChatGPT alternative for self-hosted setups. It provides:

Conversation management: Threads, search history, export
RAG support: Upload documents and query them with any model
User management: Multi-user with role-based access
Tool support: Web search, code execution, image generation (via ComfyUI integration)

The UI connects to LiteLLM's OpenAI-compatible endpoint, so it inherits all routing and fallback behavior automatically.

Hardware Requirements

Setup	GPU	RAM (System)	Storage	Users
Minimum	None (CPU)	16 GB	20 GB	1
Recommended	RTX 3090 24GB	32 GB	100 GB	1-3
Team	2× RTX 4090 24GB	64 GB	500 GB	5-15
Production	4× A100 80GB	256 GB	2 TB	50+

Without a GPU, expect 1-5 tokens/second on 7B models — usable for chat, painful for batch processing.

Production Considerations

Model Management

Don't pull every model at once. A single 70B model occupies ~40 GB of VRAM for inference plus ~40 GB of disk. Start with one small + one large model:

# Start lean
docker compose exec ollama ollama pull mistral-s3.1  # 16 GB disk, 24B params
docker compose exec ollama ollama pull qwen3.6        # 6 GB disk, 7.6B params

# Add when needed
# docker compose exec ollama ollama pull deepseek-v4   # 42 GB disk, 67B params

Security

The stack exposes a UI and an API. For production:

Put it behind the SSL Reverse Proxy stack (HTTPS + WAF)
Set WEBUI_SECRET_KEY and WEBUI_JWT_SECRET in Open WebUI
Enable user authentication (default: on)
Set LITELLM_MASTER_KEY for API access control
Restrict model access — not all users need 70B inference

Monitoring

LiteLLM exposes Prometheus metrics at /metrics. Add this endpoint to the Observability stack's Prometheus scrape config for:

Token usage per model and per user
Request latency by model
Error rates and rate limit hits
Model fallback frequency

Key Takeaways

LiteLLM is the critical layer: It decouples the UI from the inference engine, enabling model routing, fallbacks, and rate limiting without touching either component
GPU is non-negotiable for acceptable UX: CPU inference is feasible but frustrating for interactive use
Start with Qwen 3.6 (7B) for speed and Mistral Small 3.1 (24B) for quality — upgrade to DeepSeek V4 when you need maximum capability
The stack scales horizontally: Add more Ollama instances behind LiteLLM as demand grows