Build Your Own AI Infrastructure: Docker + Traefik for Self-Hosted LLMs | Thoughts & Talks

Executive Summary

Enterprises are increasingly reconsidering their cloud AI strategy due to escalating costs, data privacy concerns, and regulatory compliance requirements. Building your own AI infrastructure with Docker, Traefik, and self-hosted Large Language Models (LLMs) offers a viable path to digital sovereignty while maintaining enterprise-grade performance and scalability. This guide presents a strategic framework for deploying self-hosted AI workloads with cost savings of 60-80% compared to commercial SaaS solutions, full control over your data, and the flexibility to scale according to organizational needs.

The Challenge

The Explosion of AI SaaS Costs

Enterprise adoption of commercial LLM services has skyrocketed, with organizations spending $50,000-$500,000 monthly on AI API calls alone. While convenient, these costs are:

Unpredictable: Usage-based pricing makes budgeting impossible
Ongoing: Ever-increasing consumption without sustainable limits
Vendor-locked: Difficult to migrate or negotiate better terms after adoption

Data Sovereignty and Compliance Concerns

Sending sensitive data to third-party AI services creates significant risks:

GDPR violations: Personal data processing outside the EU
IP leakage: Proprietary datasets trained into publicly available models
Audit Trail Gaps: Limited visibility into how AI models process your data
Regulatory Compliance: Industries like healthcare and finance have strict data residency requirements

Infrastructure Complexity

Enterprise IT teams face significant hurdles when moving to self-hosted AI:

Resource Management: GPU requirements, memory optimization, scaling challenges
Orchestration: Managing multiple AI services, load balancing, failover
Security: Authentication, network isolation, vulnerability scanning
Observability: Monitoring performance, tracking token usage, debugging model behavior

The Solution

Strategic Approach: Container-Based AI Infrastructure

By leveraging Docker and Traefik, organizations can build a modular, scalable AI infrastructure that:

Simplifies deployment: One-command container launches for new AI services
Enables portability: Run anywhere—on-premise, cloud, or hybrid environments
Provides resilience: Automatic failover, load balancing, and health checks
Scales efficiently: Auto-scale based on demand while maintaining cost controls

Architecture Overview

Architecture showing Traefik ingress layer routing to containerized LLM services with GPU scheduling.

Key Architectural Benefits:

Traefik as Unified Ingress
- Dynamic Service Discovery: Automatically routes to new services without manual configuration
- SSL/TLS Automation: Let's Encrypt integration with zero manual certificate management
- Circuit Breaking: Prevents cascade failures by detecting unresponsive services
- Rate Limiting: Protect services from abuse and control API costs
Docker Containerization
- Isolation: Each AI service runs in an isolated environment
- Resource Control: CPU, memory, and GPU allocation per container
- Version Management: Easy rollback and A/B testing of different model versions
- Multi-Model Support: Run multiple LLMs (Llama, Mistral, Falcon, etc.) simultaneously
Horizontal Scaling
- Load Balancing: Distribute requests across multiple instances
- Auto-scaling: Scale based on CPU/GPU utilization, request latency, or queue depth
- Geographic Distribution: Deploy models closer to users for lower latency

Business Impact

Metric	Commercial SaaS	Self-Hosted Infrastructure	Savings
Monthly Cost (10M tokens)	$30,000	$8,000-$12,000	60-73%
Data Sovereignty	Limited	Full control	⬆
Regulatory Compliance	Challenge	Addressable	⬆
Custom Model Training	Expensive	Included	⬆
Resource Predictability	Variable	Fixed	⬆

Implementation Roadmap

Phase 1: Foundation (Week 1-2)

Infrastructure Setup:

Provision Hardware/Cloud Instances
- GPU servers (NVIDIA A100, GeForce RTX 4090, or cloud equivalents)
- Minimum 32GB RAM, 8+ vCPU, 1TB SSD storage per LLM instance
- Network: 10Gbps recommended for low-latency inference

Install Core Components

# Install Docker (latest stable)
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh

# Install Docker Compose
sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-Linux-x86_64" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

Deploy Traefik

version: '3'
services:
  traefik:
    image: traefik:v3.1
    command:
      - "--api.insecure=true"
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=true"
      - "--entrypoints.web.address=:80"
      - "--entrypoints.websecure.address=:443"
      - "--certificatesresolvers.myresolver.acme.tlschallenge=true"
      - "--certificatesresolvers.myresolver.acme.email=admin@yourcompany.com"
    ports:
      - "80:80"
      - "443:443"
      - "8080:8080"
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"

Success Criteria:

Traefik dashboard accessible at http://your-server:8080
SSL certificate auto-generation with Let's Encrypt working
Basic container routing demonstrated

Phase 2: LLM Deployment (Week 3-4)

Model Selection:

We recommend starting with open-source models optimized for various use cases:

Use Case	Recommended Model	Hardware	Context Window	Parameters
General Purpose	Llama 3 70B	4x A100/RTX 4090	8K tokens	70B
Chat & Conversations	Mistral 7B	1x A100/RTX 4090	32K tokens	7B
Code Generation	CodeLlama 34B	2x A100	16K tokens	34B
Multi-language	Qwen 72B	4x A100	32K tokens	72B

Deployment Using Docker:

version: '3'
services:
  llama3-70b:
    image: ghcr.io/microsoft/wizardlm:latest
    # Or: vllm/vllm-openai:latest --model meta-llama/Meta-Llama-3-70B
    container_name: llama3-70b
    ports:
      - "11434:8000"
    environment:
      - MODEL_NAME=meta-llama/Meta-Llama-3-70B
      - MAX_TOKENS=4096
      - TEMPERATURE=0.7
      - TOP_P=0.9
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    volumes:
      - ./models:/models
      - ./data:/data
    networks:
      - ai-network
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.llama3.rule=Host(`llama3.yourdomain.com`)"
      - "traefik.http.services.llama3.loadbalancer.server.port=8000"
```text

**API Testing:**

```bash
curl http://llama3.yourdomain.com/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-70b",
    "prompt": "Explain the benefits of digital sovereignty",
    "max_tokens": 200
  }'
```text

**Success Criteria:**

- [ ] LLM service responds to API requests within 2-5 seconds
- [ ] Traefik routes traffic correctly to LLM containers
- [ ] Load balancer distributes requests across multiple instances
- [ ] GPU utilization visible (40-60% target)

### Phase 3: Security & Compliance (Week 5-6)

**Authentication Layer:**

```yaml
services:
  keycloak:
    image: bitnami/keycloak:24
    ports:
      - "8080:8080"
    environment:
      - KEYCLOAK_ADMIN_USER=admin
      - KEYCLOAK_ADMIN_PASSWORD=secure_password
      - KEYCLOAK_ADMIN_REALM=ai-platform
    volumes:
      - keycloak_data:/bitnami/keycloak
    networks:
      - ai-network
```text

**Network Isolation:**

- **VLAN Segmentation**: Separate AI services into isolated network segments
- **Firewall Rules**: Restrict inbound/outbound traffic to minimum necessary ports
- **Service Mesh**: Implement mutual TLS (mTLS) for service-to-service communication

**Vulnerability Scanning:**

```bash
# Scan containers for vulnerabilities
trivy image ghcr.io/microsoft/wizardlm:latest

# Scan running containers
trivy image --severity HIGH,CRITICAL
```text

### Phase 4: Observability & Monitoring (Week 7-8)

**Metrics Collection:**

```yaml
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    networks:
      - ai-network

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana
    networks:
      - ai-network
```text

**Key Metrics to Track:**

1. **Request Latency**: P50, P95, P99 response times
2. **Throughput**: Requests per second, tokens per second
3. **Resource Utilization**: GPU memory, GPU compute, RAM
4. **Error Rate**: HTTP 500s, timeout failures, out-of-memory errors
5. **Token Costs**: Track token generation by service/end-user

## Technical Implementation

### Hands-on Setup Guide

For step-by-step technical tutorials covering Docker installation, Traefik configuration, and individual LLM deployments, we recommend:

- [Docker Installation Guide](https://docs.docker.com/engine/install/)
- [Traefik Configuration](https://doc.traefik.io/traefik/providers/docker/)
- [vLLM Documentation](https://docs.vllm.ai/)
- [Ollama Documentation](https://ollama.com/docs)

These guides provide the specific installation commands and configuration details. This article focuses on:

- **Strategic Decision Framework**: When to self-host vs. use SaaS services
- **Business Case Analysis**: Total cost of ownership, ROI calculations
- **Enterprise Architecture Patterns**: Multi-tenant isolation, shard strategies, caching layers
- **Operational Best Practices**: Incident response, capacity planning,_upgrade strategies
- **Integration Patterns**: Connecting self-hosted LLMs with existing enterprise systems



### Cost Optimization Strategies

### 1. Model Quantization

- Reduce model size by 50-75% with minimal accuracy loss
- Example: Llama 3 70B → 8B (4x memory reduction, 3.5x faster)

### 2. Dynamic Scaling

- Scale to zero during off-hours to save compute costs
- Auto-scale based on request queue depth (target: <5 seconds wait time)
- Spot instances for development/testing (70% cost savings)

### 3. Caching Layer

- Cache repeated queries to reduce compute requirements
- Redis or Memcached for high-traffic scenarios
- Typical cache hit rate: 30-45% for enterprise workloads

### 4. Hardware Optimization

- GPU sharing: Multiple models on same GPU (e.g., 2 smaller + 1 larger)
- Model sharding: Distribution across multiple GPUs for larger models
- Mixed-precision: BF16/FP16 inference for 2x speed (minor accuracy trade-off)

## Next Steps

**For CTOs and Technology Leaders:**

1. **Assess Readiness**: Audit current AI spend, data sensitivity, team capabilities
2. **Proof of Concept**: Deploy Llama 3 in staging environment (1-2 week effort)
3. **Cost-Benefit Analysis**: Calculate 12-month ROI based on projected usage
4. **Skills Development**: Train DevOps team on Docker, Traefik, GPU management

**For Consultants Implementing This Solution:**

1. **Architecture Review**: Design scalable infrastructure for client's specific needs
2. **Pilot Deployment**: Start with 1-2 models, validate performance
3. **Operational Handover**: Document all processes, provide training
4. **Ongoing Optimization**: Regular reviews, model updates, capacity planning

---

## Get Started Today

**Need help building your self-hosted AI infrastructure?**

[Contact Form →](/imprint/)

> Expert Quote: *Self-hosted AI infrastructure reduces enterprise AI costs by 60-80% while providing complete data sovereignty. The initial 4-6 week implementation delivers immediate value with long-term flexibility.* — Industry Analyst Report, 2025

---

**Related Resources:**

- [AI-Enabled DevOps: From Manual to Automated Operations](/talks_and_thoughts/ai-enabled-devops-manual-to-automated-operations/)
- [Digital Sovereignty: Why Self-Hosting AI Matters for Enterprise](/digital-sovereignty/)