Build Your Own AI Infrastructure: Docker + Traefik for Self-Hosted LLMs
· ~6 min readExecutive Summary
Enterprises are increasingly reconsidering their cloud AI strategy due to escalating costs, data privacy concerns, and regulatory compliance requirements. Building your own AI infrastructure with Docker, Traefik, and self-hosted Large Language Models (LLMs) offers a viable path to digital sovereignty while maintaining enterprise-grade performance and scalability. This guide presents a strategic framework for deploying self-hosted AI workloads with cost savings of 60-80% compared to commercial SaaS solutions, full control over your data, and the flexibility to scale according to organizational needs.
The Challenge
The Explosion of AI SaaS Costs
Enterprise adoption of commercial LLM services has skyrocketed, with organizations spending $50,000-$500,000 monthly on AI API calls alone. While convenient, these costs are:
- Unpredictable: Usage-based pricing makes budgeting impossible
- Ongoing: Ever-increasing consumption without sustainable limits
- Vendor-locked: Difficult to migrate or negotiate better terms after adoption
Data Sovereignty and Compliance Concerns
Sending sensitive data to third-party AI services creates significant risks:
- GDPR violations: Personal data processing outside the EU
- IP leakage: Proprietary datasets trained into publicly available models
- Audit Trail Gaps: Limited visibility into how AI models process your data
- Regulatory Compliance: Industries like healthcare and finance have strict data residency requirements
Infrastructure Complexity
Enterprise IT teams face significant hurdles when moving to self-hosted AI:
- Resource Management: GPU requirements, memory optimization, scaling challenges
- Orchestration: Managing multiple AI services, load balancing, failover
- Security: Authentication, network isolation, vulnerability scanning
- Observability: Monitoring performance, tracking token usage, debugging model behavior
The Solution
Strategic Approach: Container-Based AI Infrastructure
By leveraging Docker and Traefik, organizations can build a modular, scalable AI infrastructure that:
- Simplifies deployment: One-command container launches for new AI services
- Enables portability: Run anywhere—on-premise, cloud, or hybrid environments
- Provides resilience: Automatic failover, load balancing, and health checks
- Scales efficiently: Auto-scale based on demand while maintaining cost controls
Architecture Overview
Architecture showing Traefik ingress layer routing to containerized LLM services with GPU scheduling.
Key Architectural Benefits:
-
Traefik as Unified Ingress
- Dynamic Service Discovery: Automatically routes to new services without manual configuration
- SSL/TLS Automation: Let's Encrypt integration with zero manual certificate management
- Circuit Breaking: Prevents cascade failures by detecting unresponsive services
- Rate Limiting: Protect services from abuse and control API costs
-
Docker Containerization
- Isolation: Each AI service runs in an isolated environment
- Resource Control: CPU, memory, and GPU allocation per container
- Version Management: Easy rollback and A/B testing of different model versions
- Multi-Model Support: Run multiple LLMs (Llama, Mistral, Falcon, etc.) simultaneously
-
Horizontal Scaling
- Load Balancing: Distribute requests across multiple instances
- Auto-scaling: Scale based on CPU/GPU utilization, request latency, or queue depth
- Geographic Distribution: Deploy models closer to users for lower latency
Business Impact
| Metric | Commercial SaaS | Self-Hosted Infrastructure | Savings |
|---|---|---|---|
| Monthly Cost (10M tokens) | $30,000 | $8,000-$12,000 | 60-73% |
| Data Sovereignty | Limited | Full control | ⬆ |
| Regulatory Compliance | Challenge | Addressable | ⬆ |
| Custom Model Training | Expensive | Included | ⬆ |
| Resource Predictability | Variable | Fixed | ⬆ |
Implementation Roadmap
Phase 1: Foundation (Week 1-2)
Infrastructure Setup:
-
Provision Hardware/Cloud Instances
- GPU servers (NVIDIA A100, GeForce RTX 4090, or cloud equivalents)
- Minimum 32GB RAM, 8+ vCPU, 1TB SSD storage per LLM instance
- Network: 10Gbps recommended for low-latency inference
-
Install Core Components
# Install Docker (latest stable) curl -fsSL https://get.docker.com -o get-docker.sh sh get-docker.sh # Install Docker Compose sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-Linux-x86_64" -o /usr/local/bin/docker-compose sudo chmod +x /usr/local/bin/docker-compose -
Deploy Traefik
version: '3' services: traefik: image: traefik:v3.1 command: - "--api.insecure=true" - "--providers.docker=true" - "--providers.docker.exposedbydefault=true" - "--entrypoints.web.address=:80" - "--entrypoints.websecure.address=:443" - "--certificatesresolvers.myresolver.acme.tlschallenge=true" - "--certificatesresolvers.myresolver.acme.email=admin@yourcompany.com" ports: - "80:80" - "443:443" - "8080:8080" volumes: - "/var/run/docker.sock:/var/run/docker.sock:ro"
Success Criteria:
- Traefik dashboard accessible at
http://your-server:8080 - SSL certificate auto-generation with Let's Encrypt working
- Basic container routing demonstrated
Phase 2: LLM Deployment (Week 3-4)
Model Selection:
We recommend starting with open-source models optimized for various use cases:
| Use Case | Recommended Model | Hardware | Context Window | Parameters |
|---|---|---|---|---|
| General Purpose | Llama 3 70B | 4x A100/RTX 4090 | 8K tokens | 70B |
| Chat & Conversations | Mistral 7B | 1x A100/RTX 4090 | 32K tokens | 7B |
| Code Generation | CodeLlama 34B | 2x A100 | 16K tokens | 34B |
| Multi-language | Qwen 72B | 4x A100 | 32K tokens | 72B |
Deployment Using Docker:
version: '3'
services:
llama3-70b:
image: ghcr.io/microsoft/wizardlm:latest
# Or: vllm/vllm-openai:latest --model meta-llama/Meta-Llama-3-70B
container_name: llama3-70b
ports:
- "11434:8000"
environment:
- MODEL_NAME=meta-llama/Meta-Llama-3-70B
- MAX_TOKENS=4096
- TEMPERATURE=0.7
- TOP_P=0.9
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
volumes:
- ./models:/models
- ./data:/data
networks:
- ai-network
labels:
- "traefik.enable=true"
- "traefik.http.routers.llama3.rule=Host(`llama3.yourdomain.com`)"
- "traefik.http.services.llama3.loadbalancer.server.port=8000"
```text
**API Testing:**
```bash
curl http://llama3.yourdomain.com/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-70b",
"prompt": "Explain the benefits of digital sovereignty",
"max_tokens": 200
}'
```text
**Success Criteria:**
- [ ] LLM service responds to API requests within 2-5 seconds
- [ ] Traefik routes traffic correctly to LLM containers
- [ ] Load balancer distributes requests across multiple instances
- [ ] GPU utilization visible (40-60% target)
### Phase 3: Security & Compliance (Week 5-6)
**Authentication Layer:**
```yaml
services:
keycloak:
image: bitnami/keycloak:24
ports:
- "8080:8080"
environment:
- KEYCLOAK_ADMIN_USER=admin
- KEYCLOAK_ADMIN_PASSWORD=secure_password
- KEYCLOAK_ADMIN_REALM=ai-platform
volumes:
- keycloak_data:/bitnami/keycloak
networks:
- ai-network
```text
**Network Isolation:**
- **VLAN Segmentation**: Separate AI services into isolated network segments
- **Firewall Rules**: Restrict inbound/outbound traffic to minimum necessary ports
- **Service Mesh**: Implement mutual TLS (mTLS) for service-to-service communication
**Vulnerability Scanning:**
```bash
# Scan containers for vulnerabilities
trivy image ghcr.io/microsoft/wizardlm:latest
# Scan running containers
trivy image --severity HIGH,CRITICAL
```text
### Phase 4: Observability & Monitoring (Week 7-8)
**Metrics Collection:**
```yaml
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
networks:
- ai-network
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana_data:/var/lib/grafana
networks:
- ai-network
```text
**Key Metrics to Track:**
1. **Request Latency**: P50, P95, P99 response times
2. **Throughput**: Requests per second, tokens per second
3. **Resource Utilization**: GPU memory, GPU compute, RAM
4. **Error Rate**: HTTP 500s, timeout failures, out-of-memory errors
5. **Token Costs**: Track token generation by service/end-user
## Technical Implementation
### Hands-on Setup Guide
For step-by-step technical tutorials covering Docker installation, Traefik configuration, and individual LLM deployments, we recommend:
- [Docker Installation Guide](https://docs.docker.com/engine/install/)
- [Traefik Configuration](https://doc.traefik.io/traefik/providers/docker/)
- [vLLM Documentation](https://docs.vllm.ai/)
- [Ollama Documentation](https://ollama.com/docs)
These guides provide the specific installation commands and configuration details. This article focuses on:
- **Strategic Decision Framework**: When to self-host vs. use SaaS services
- **Business Case Analysis**: Total cost of ownership, ROI calculations
- **Enterprise Architecture Patterns**: Multi-tenant isolation, shard strategies, caching layers
- **Operational Best Practices**: Incident response, capacity planning,_upgrade strategies
- **Integration Patterns**: Connecting self-hosted LLMs with existing enterprise systems
### Cost Optimization Strategies
### 1. Model Quantization
- Reduce model size by 50-75% with minimal accuracy loss
- Example: Llama 3 70B → 8B (4x memory reduction, 3.5x faster)
### 2. Dynamic Scaling
- Scale to zero during off-hours to save compute costs
- Auto-scale based on request queue depth (target: <5 seconds wait time)
- Spot instances for development/testing (70% cost savings)
### 3. Caching Layer
- Cache repeated queries to reduce compute requirements
- Redis or Memcached for high-traffic scenarios
- Typical cache hit rate: 30-45% for enterprise workloads
### 4. Hardware Optimization
- GPU sharing: Multiple models on same GPU (e.g., 2 smaller + 1 larger)
- Model sharding: Distribution across multiple GPUs for larger models
- Mixed-precision: BF16/FP16 inference for 2x speed (minor accuracy trade-off)
## Next Steps
**For CTOs and Technology Leaders:**
1. **Assess Readiness**: Audit current AI spend, data sensitivity, team capabilities
2. **Proof of Concept**: Deploy Llama 3 in staging environment (1-2 week effort)
3. **Cost-Benefit Analysis**: Calculate 12-month ROI based on projected usage
4. **Skills Development**: Train DevOps team on Docker, Traefik, GPU management
**For Consultants Implementing This Solution:**
1. **Architecture Review**: Design scalable infrastructure for client's specific needs
2. **Pilot Deployment**: Start with 1-2 models, validate performance
3. **Operational Handover**: Document all processes, provide training
4. **Ongoing Optimization**: Regular reviews, model updates, capacity planning
---
## Get Started Today
**Need help building your self-hosted AI infrastructure?**
[Contact Form →](/imprint/)
> Expert Quote: *Self-hosted AI infrastructure reduces enterprise AI costs by 60-80% while providing complete data sovereignty. The initial 4-6 week implementation delivers immediate value with long-term flexibility.* — Industry Analyst Report, 2025
---
**Related Resources:**
- [AI-Enabled DevOps: From Manual to Automated Operations](/talks_and_thoughts/ai-enabled-devops-manual-to-automated-operations/)
- [Digital Sovereignty: Why Self-Hosting AI Matters for Enterprise](/digital-sovereignty/)