Grafana for Infrastructure Observability: Dashboards, Alerting and Data Sources | DevOps

Why Grafana

Grafana has become the de-facto standard for infrastructure observability dashboards. Its support for multiple data sources, flexible query languages, and built-in alerting make it the central pane of glass for modern operations teams.

Architecture

A typical observability stack with Grafana:

┌─────────────┐     ┌──────────┐     ┌──────────────┐
│ Prometheus  │────▶│ Grafana  │◀────│    Loki      │
│ (metrics)   │     │ Server   │     │ (logs)       │
└─────────────┘     └──────────┘     └──────────────┘
         │               │                    │
         ▼               ▼                    ▼
   Exporters on     Dashboards +          Log agents
   every host       Alerting              (promtail)

Connecting Data Sources

Start by adding data sources in Configuration → Data Sources:

Prometheus (metrics):

URL: http://prometheus:9090
Scrape interval: 15s
Default: Yes

Loki (logs):

URL: http://loki:3100
Derived fields: extract traceID from log lines for trace correlation

InfluxDB or Graphite can be added alongside for legacy data.

Building Effective Dashboards

Dashboard Structure

Organize dashboards by service layer:

/Infrastructure/
├── Nodes Overview     — CPU, memory, disk, network across all hosts
├── Kubernetes Cluster — Pod health, resource usage, node status
└── Network Latency    — Inter-service latency, packet loss

/Services/
├── Web Server Farm    — Request rate, error rate, response times
├── Database Cluster   — Connections, query latency, replication lag
└── Message Queue      — Queue depth, consumer lag

Panels and Queries

A typical node CPU panel using PromQL:

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Add thresholds for visual clarity:

{
  "thresholds": [
    {"value": 80, "color": "yellow"},
    {"value": 95, "color": "red"}
  ]
}

Variables and Templates

Use dashboard variables to make dashboards reusable:

Variable: $host
Type: Query
Query: label_values(node_os_version, instance)

This populates a host dropdown, and all panels use $host in their queries:

node_memory_MemAvailable_bytes{instance="$host"} / node_memory_MemTotal_bytes{instance="$host"}

Alerting

Grafana Alerting replaces standalone Alertmanager configuration. Create alert rules directly in the UI:

Create alert rule from any panel's query
Set evaluation interval (e.g., every 30s for 5m)
Define conditions: WHEN avg() OF query(A, 5m) IS ABOVE 90
Configure notification channels (Slack, PagerDuty, email)

For Slack notifications, configure the contact point in Alerting → Contact points:

Type: Slack
Webhook URL: https://hooks.slack.com/services/YOUR/WEBHOOK
Title: {{ .Alert.Name }} — {{ .Labels.instance }}
Message: {{ .Annotations.description }}

Annotations and Recording Rules

Use annotations to mark events on dashboards (deployments, config changes):

# Add annotation from CI/CD pipeline
curl -X POST http://grafana:3000/api/annotations \
  -H "Authorization: Bearer $GRAFANA_TOKEN" \
  -d '{"dashboardUID":"abc123","time":'$(date +%s)'000,"text":"Deploy v2.1.0","tags":["deploy"]}'

Prometheus recording rules pre-compute expensive queries that dashboards use frequently:

# rules/recording.yml
groups:
  - name: node_rules
    rules:
      - record: node:cpu_utilization:avg_5m
        expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))

Best Practices

Use template variables — never hardcode host names or data sources in panels
Keep dashboards focused — one dashboard per service or concern, not a single giant dashboard
Set appropriate refresh intervals — 30s for operational dashboards, 5m for capacity planning views
Document with descriptions — add panel descriptions explaining what each graph measures
Version control — export dashboards as JSON and store them in Git
Silence before fixing — use Grafana silences to suppress alerts during maintenance, avoid disabling alert rules entirely