Grafana for Infrastructure Observability: Dashboards, Alerting and Data Sources
Why Grafana
Grafana has become the de-facto standard for infrastructure observability dashboards. Its support for multiple data sources, flexible query languages, and built-in alerting make it the central pane of glass for modern operations teams.
Architecture
A typical observability stack with Grafana:
┌─────────────┐ ┌──────────┐ ┌──────────────┐
│ Prometheus │────▶│ Grafana │◀────│ Loki │
│ (metrics) │ │ Server │ │ (logs) │
└─────────────┘ └──────────┘ └──────────────┘
│ │ │
▼ ▼ ▼
Exporters on Dashboards + Log agents
every host Alerting (promtail)
Connecting Data Sources
Start by adding data sources in Configuration → Data Sources:
Prometheus (metrics):
URL: http://prometheus:9090
Scrape interval: 15s
Default: Yes
Loki (logs):
URL: http://loki:3100
Derived fields: extract traceID from log lines for trace correlation
InfluxDB or Graphite can be added alongside for legacy data.
Building Effective Dashboards
Dashboard Structure
Organize dashboards by service layer:
/Infrastructure/
├── Nodes Overview — CPU, memory, disk, network across all hosts
├── Kubernetes Cluster — Pod health, resource usage, node status
└── Network Latency — Inter-service latency, packet loss
/Services/
├── Web Server Farm — Request rate, error rate, response times
├── Database Cluster — Connections, query latency, replication lag
└── Message Queue — Queue depth, consumer lag
Panels and Queries
A typical node CPU panel using PromQL:
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Add thresholds for visual clarity:
{
"thresholds": [
{"value": 80, "color": "yellow"},
{"value": 95, "color": "red"}
]
}
Variables and Templates
Use dashboard variables to make dashboards reusable:
Variable: $host
Type: Query
Query: label_values(node_os_version, instance)
This populates a host dropdown, and all panels use $host in their queries:
node_memory_MemAvailable_bytes{instance="$host"} / node_memory_MemTotal_bytes{instance="$host"}
Alerting
Grafana Alerting replaces standalone Alertmanager configuration. Create alert rules directly in the UI:
- Create alert rule from any panel's query
- Set evaluation interval (e.g., every 30s for 5m)
- Define conditions:
WHEN avg() OF query(A, 5m) IS ABOVE 90 - Configure notification channels (Slack, PagerDuty, email)
For Slack notifications, configure the contact point in Alerting → Contact points:
Type: Slack
Webhook URL: https://hooks.slack.com/services/YOUR/WEBHOOK
Title: {{ .Alert.Name }} — {{ .Labels.instance }}
Message: {{ .Annotations.description }}
Annotations and Recording Rules
Use annotations to mark events on dashboards (deployments, config changes):
# Add annotation from CI/CD pipeline
curl -X POST http://grafana:3000/api/annotations \
-H "Authorization: Bearer $GRAFANA_TOKEN" \
-d '{"dashboardUID":"abc123","time":'$(date +%s)'000,"text":"Deploy v2.1.0","tags":["deploy"]}'
Prometheus recording rules pre-compute expensive queries that dashboards use frequently:
# rules/recording.yml
groups:
- name: node_rules
rules:
- record: node:cpu_utilization:avg_5m
expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
Best Practices
- Use template variables — never hardcode host names or data sources in panels
- Keep dashboards focused — one dashboard per service or concern, not a single giant dashboard
- Set appropriate refresh intervals — 30s for operational dashboards, 5m for capacity planning views
- Document with descriptions — add panel descriptions explaining what each graph measures
- Version control — export dashboards as JSON and store them in Git
- Silence before fixing — use Grafana silences to suppress alerts during maintenance, avoid disabling alert rules entirely