Building a Production Observability Stack with Grafana, Prometheus, Loki, and Alloy
Why Build Your Own Stack
Managed observability services (Datadog, Grafana Cloud, New Relic) are powerful but expensive. At scale, a self-hosted Grafana + Prometheus + Loki stack saves thousands per month and gives you full control over retention, alerting, and data governance.
The stack presented here runs on any single VM with Docker Compose, and scales horizontally to Kubernetes without architecture changes.
Architecture

Core Components
Grafana — The Pane of Glass
Grafana serves as the unified dashboard layer. With auto-provisioned data sources, connecting Prometheus (metrics) and Loki (logs) takes zero manual configuration after first deploy.
Key configuration decisions:
- Provisioning via YAML: Dashboards, data sources, and alert channels are all defined as files mounted into the container. This makes them version-controllable and reproducible — no click-ops in the UI.
- Anonymous access on private networks: For internal deployments, enabling anonymous access with viewer permissions removes the login friction. For internet-facing setups, enable OAuth via GitHub, Google, or your IdP.
- Pre-built dashboards: Ship the Node Exporter Full dashboard and Docker Containers dashboard as JSON — the two most-requested views for infrastructure monitoring.
Prometheus — Metrics Collection
Prometheus scrapes targets on a pull model. The configuration decision that matters most is scrape interval vs. storage cost:
| Scrape Interval | Daily Storage (per host) | Retention | Use Case |
|---|---|---|---|
| 15s | ~150 MB | 30 days | Critical infra |
| 30s | ~75 MB | 14 days | Standard |
| 60s | ~35 MB | 7 days | Non-critical |
The stack defaults to 15s for core targets (node_exporter, cAdvisor) and 30s for application endpoints. This gives good granularity without blowing up storage.
Loki — Log Aggregation
Unlike Elasticsearch, Loki does not index log content — it indexes only metadata labels. This makes it dramatically cheaper to operate:
- No full-text indexing: Storage is ~10-20% of Elasticsearch for the same log volume
- Label-based querying: Log streams are identified by labels like
job,instance,container_name - LogQL: Powerful query language that's closer to PromQL than Lucene
The key configuration parameter is retention_period. The stack defaults to 14 days at which point log data is automatically pruned.
Alloy — The Collector Layer
Grafana Alloy replaces the previous agent / promtail / tempo pattern with a single binary that handles metrics, logs, and traces. Its configuration uses a new DSL called Alloy syntax (.alloy files):
// Discovery: find all Docker containers
discovery.docker "local" {
host = "unix:///var/run/docker.sock"
}
// Scrape metrics from discovered containers
prometheus.scrape "default" {
targets = discovery.docker.local.targets
forward_to = [prometheus.remote_write.default.receiver]
}
prometheus.remote_write "default" {
endpoint {
url = "http://prometheus:9090/api/v1/write"
}
}
This replaces the old pattern of running node_exporter + promtail + cadvisor as separate processes.
Production Considerations
Retention Tuning
The biggest operational cost in an observability stack is storage. Set retention based on your actual needs:
# Prometheus: 30 days for high-res, 90 days for downsampled
# Loki: 14 days of raw logs, archive older to S3
# Grafana: snapshot dashboards periodically, don't rely on browser cache
Alerting
Grafana's built-in alerting replaces standalone Alertmanager. Alert rules evaluate against Prometheus queries and route via contact points:
- Critical → PagerDuty or Slack webhook with @here
- Warning → Email or Slack channel
- Info → Dashboard annotations only
Scaling
This stack runs on a single 2 vCPU / 4 GB RAM VM for workloads up to ~50 hosts. Beyond that:
- Add a second Prometheus in remote-write mode
- Shard Loki into multiple ingesters
- Move Alloy to each target host instead of centralized collection
Key Takeaways
- A self-hosted observability stack saves $500-2000/month at 10+ host scale compared to managed services
- Grafana + Prometheus + Loki is the industry standard — every major platform supports these protocols
- Alloy simplifies the collector layer to a single config file
- Start with the Compose stack, graduate to Kubernetes when you outgrow it