Building a Production Observability Stack with Grafana, Prometheus, Loki, and Alloy | DevOps

Why Build Your Own Stack

Managed observability services (Datadog, Grafana Cloud, New Relic) are powerful but expensive. At scale, a self-hosted Grafana + Prometheus + Loki stack saves thousands per month and gives you full control over retention, alerting, and data governance.

The stack presented here runs on any single VM with Docker Compose, and scales horizontally to Kubernetes without architecture changes.

Architecture

Core Components

Grafana — The Pane of Glass

Grafana serves as the unified dashboard layer. With auto-provisioned data sources, connecting Prometheus (metrics) and Loki (logs) takes zero manual configuration after first deploy.

Key configuration decisions:

Provisioning via YAML: Dashboards, data sources, and alert channels are all defined as files mounted into the container. This makes them version-controllable and reproducible — no click-ops in the UI.
Anonymous access on private networks: For internal deployments, enabling anonymous access with viewer permissions removes the login friction. For internet-facing setups, enable OAuth via GitHub, Google, or your IdP.
Pre-built dashboards: Ship the Node Exporter Full dashboard and Docker Containers dashboard as JSON — the two most-requested views for infrastructure monitoring.

Prometheus — Metrics Collection

Prometheus scrapes targets on a pull model. The configuration decision that matters most is scrape interval vs. storage cost:

Scrape Interval	Daily Storage (per host)	Retention	Use Case
15s	~150 MB	30 days	Critical infra
30s	~75 MB	14 days	Standard
60s	~35 MB	7 days	Non-critical

The stack defaults to 15s for core targets (node_exporter, cAdvisor) and 30s for application endpoints. This gives good granularity without blowing up storage.

Loki — Log Aggregation

Unlike Elasticsearch, Loki does not index log content — it indexes only metadata labels. This makes it dramatically cheaper to operate:

No full-text indexing: Storage is ~10-20% of Elasticsearch for the same log volume
Label-based querying: Log streams are identified by labels like job, instance, container_name
LogQL: Powerful query language that's closer to PromQL than Lucene

The key configuration parameter is retention_period. The stack defaults to 14 days at which point log data is automatically pruned.

Alloy — The Collector Layer

Grafana Alloy replaces the previous agent / promtail / tempo pattern with a single binary that handles metrics, logs, and traces. Its configuration uses a new DSL called Alloy syntax (.alloy files):

// Discovery: find all Docker containers
discovery.docker "local" {
  host = "unix:///var/run/docker.sock"
}

// Scrape metrics from discovered containers
prometheus.scrape "default" {
  targets    = discovery.docker.local.targets
  forward_to = [prometheus.remote_write.default.receiver]
}

prometheus.remote_write "default" {
  endpoint {
    url = "http://prometheus:9090/api/v1/write"
  }
}

This replaces the old pattern of running node_exporter + promtail + cadvisor as separate processes.

Production Considerations

Retention Tuning

The biggest operational cost in an observability stack is storage. Set retention based on your actual needs:

# Prometheus: 30 days for high-res, 90 days for downsampled
# Loki: 14 days of raw logs, archive older to S3
# Grafana: snapshot dashboards periodically, don't rely on browser cache

Alerting

Grafana's built-in alerting replaces standalone Alertmanager. Alert rules evaluate against Prometheus queries and route via contact points:

Critical → PagerDuty or Slack webhook with @here
Warning → Email or Slack channel
Info → Dashboard annotations only

Scaling

This stack runs on a single 2 vCPU / 4 GB RAM VM for workloads up to ~50 hosts. Beyond that:

Add a second Prometheus in remote-write mode
Shard Loki into multiple ingesters
Move Alloy to each target host instead of centralized collection

Key Takeaways

A self-hosted observability stack saves $500-2000/month at 10+ host scale compared to managed services
Grafana + Prometheus + Loki is the industry standard — every major platform supports these protocols
Alloy simplifies the collector layer to a single config file
Start with the Compose stack, graduate to Kubernetes when you outgrow it