Nagios Production Monitoring: Setup, Configuration and Best Practices | DevOps

Why Nagios

Nagios Core remains one of the most widely deployed monitoring frameworks in production environments. Its plugin architecture, passive check support, and distributed monitoring capabilities make it a solid choice for organizations that need full control over their monitoring stack without vendor lock-in.

Core Components

A typical Nagios deployment consists of:

Nagios Core — the monitoring engine that schedules checks, processes results, and triggers alerts
NRPE (Nagios Remote Plugin Executor) — runs checks on remote hosts and returns results
NSCA (Nagios Service Check Acceptor) — receives passive check results from remote systems
Check plugins — the actual scripts that perform checks (disk, CPU, load, processes, custom)

Basic Server Setup

Install Nagios Core and the standard plugins:

apt install nagios4 nagios-plugins-contrib nagios-nrpe-plugin

The main configuration lives in /etc/nagios4/:

File	Purpose
`nagios.cfg`	Main daemon configuration
`conf.d/hosts.cfg`	Host definitions
`conf.d/services.cfg`	Service definitions
`conf.d/contacts.cfg`	Alert recipients
`objects/commands.cfg`	Check command definitions

Adding Hosts and Services

Define each monitored server in conf.d/hosts.cfg:

define host {
    use             linux-server
    host_name       web-01
    alias           Production Web Server
    address         10.0.1.10
    check_command   check-host-alive
    max_check_attempts 3
    notification_interval 120
}

Define which services to check on that host in conf.d/services.cfg:

define service {
    use                 generic-service
    host_name           web-01
    service_description HTTP Response
    check_command       check_http!-H 10.0.1.10 -u /health -w 3 -c 5
    check_interval      5
    retry_interval      1
}

Alert Configuration

Nagios sends notifications via the notify-host-by-email and notify-service-by-email commands. Configure recipients in contacts.cfg:

define contact {
    contact_name            ops-team
    alias                   Operations Team
    email                   ops@example.com
    service_notification_commands notify-service-by-email
    host_notification_commands    notify-host-by-email
}

For Slack integration, replace the notification command with a webhook POST:

#!/bin/bash
# /usr/lib/nagios/plugins/notify-slack
WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK"
MESSAGE="Nagios: $NOTIFICATIONTYPE - $HOSTALIAS/$SERVICEDESC is $SERVICESTATE"
curl -s -X POST -H 'Content-type: application/json' \
  --data "{\"text\":\"$MESSAGE\"}" "$WEBHOOK_URL"

Distributed Monitoring with NSCA

For large environments, use a distributed setup: satellite Nagios instances perform local checks and forward results to a central server via NSCA.

On the central server (receiver):

apt install nsca
# configure /etc/nsca.cfg with server_address and password

On each remote satellite:

# After local check completes, send result to central
/usr/bin/send_nsca -H central.example.com -c /etc/send_nsca.cfg <<EOF
web-01\tHTTP Response\t0\tHTTP OK - 0.142s response
EOF

Best Practices

Check intervals: Use 5-minute intervals for standard checks, 1-minute for critical services. Avoid sub-minute checks unless absolutely necessary.
Max check attempts: Set to 3 minimum to prevent flapping from triggering alerts.
Dependencies: Define service dependencies so downstream failures don't cascade (e.g., if a database is down, don't alert on the app that depends on it).
Passive checks: Prefer passive checks for services that are expensive to probe (log analysis, complex application health).
Configuration management: Store Nagios configs in Git and deploy with Ansible or similar tool.
Log rotation: Monitor Nagios log size; spool files can grow quickly in large deployments.