Nagios Howto: Notification Escalations, EventHandlers & Remote Service Monitoring With NRPE

From the Nagios website, "Nagios is an open source host, service and network monitoring program," however, this description is not as accurate as it could be. In it’s most basic state of configuration, Nagios is a fantastic tool for locally monitoring network devices and services.

Once you step past the most basic installation and configuration, Nagios transforms your systems by providing service outage notification escalations and self-healing network services with the use of the ServiceEscalation and EventHandler configuration objects. This, combined with the ability to monitor unlimited remote hosts and services, makes Nagios an irreplaceable tool for any systems or network administrator.

This whitepaper is intended to instruct you on the implementation of the more advanced configuration features available in Nagios, including EventHandlers, NotificationEscalations, and monitoring of remote network devices and services using NRPE. This paper assumes that you have a working Nagios installation and at least one remote device or service to monitor. Also, for the purpose of this work, we’ll focus on the Linux operating system.

Before we jump right in, if you haven’t already done so this is excellent opportunity to organize your configuration file, as we’ll be jumping around from section to section. I recommend that you create separate configuration files for each option in your object configuration file and store these files in the nagios/etc directory. Simply include each of these configuration files into your primary nagios.cfg file in no particular order. The include directive looks like this:

cfg_file=/usr/local/nagios/etc/timeperiods.cfg

And the ‘timeperiods.cfg’ would like something like this:

define timeperiod{
    timeperiod_name 24×7
    alias 24 Hours A Day, 7 Days A Week
    sunday 00:00-24:00
    monday 00:00-24:00
    tuesday 00:00-24:00
    wednesday 00:00-24:00
    thursday 00:00-24:00
    friday 00:00-24:00
    saturday 00:00-24:00
}

Using this type of organization for your configuration files will save you a lot of time down the line, so it’s best to handle this organization task early on if you have not already done so.

Notification Escalations

Here’s the scenario that we will implement and you will come away with the knowledge to build upon and refine the Notification Escalations to your needs.

The service we will focus on is POP3 on the local machine, but any service would be the same procedure for configuration. We’ll cover remote monitoring and escalations later in this paper. For now, we have a Linux server running the POP3 service daemon in a web hosting environment which requires 24/7 operation and outage notification, Crucial Web Hosting. We need to create two escalation levels in addition to the default notification for the service.

We will check our POP3 service for an OK state once every 3 minutes. If we receive anything other than an OK state, we want the available technical administrators to be notified via email immediately on the first failure. We then begin checking the POP3 service once a minute for an OK state. If no OK state is received after an additional 10 minutes, we then "escalate" the notification of this service problem. This is our first escalation. We are now approximately 13 minutes into the service failure. Our available system administrators have been notified twice and the service is still down.

At this time we want to escalate this service outage so our Level 3 Crucial technicians know about it and someone can get the service problem repaired. This is where our service escalation configuration object takes over and we get into the examples.

In our serviceescalations.cfg file we have the following two service notification escalations:

# First Escalation
define serviceescalation{
    host_name host-xx.crucialwebhost.com
    service_description POP3
    first_notification 2
    last_notification 3
    notification_interval 30
    contact_groups admins,level-three
}

# Second Escalation
define serviceescalation{
    host_name host-xx.crucialwebhost.com
    service_description POP3
    first_notification 3
    last_notification 10
    notification_interval 60
    contact_groups admins,level-three,pagers
}

In the first service escalation you can see that the first notification is set at 2. This simply means that this configuration object will take over from the default service notification on the second notification that is sent out. This escalation will also serve the following notification as well, as you can see the last notification is set at 3, the second and third notifications will be handled by this service notification object. Also, notice that the service notification_interval has been greatly increased. This also takes over from the default service notification configuration. So, instead of 10 minute notification intervals, we now have 30 minute notification intervals.

The reason is simple and can change with your environment. Crucial Web Hosting has the contact group level-three configured to open a Level 3 trouble ticket in our support center. This notifies all Crucial administrative personnel of the service problem, allowing even our Level 1 techs to immediately become aware of even a POP3 service outage on any server. We don’t want to flood our support center ticket system with tickets, so we open a Level 3 ticket and then one more 30 minutes later if the service problem still exists.

At this time we are now 13 minutes into our service outage, and the Level 3 technicians are aware due to the creation of the Level 3 support ticket. Our Level 1 techs are also aware in the support center and are able to assist clients immediately. If an additional 30 minutes passes, the third service notification goes out, which would create a second trouble ticket in Level 3 support. At this point we are either under attack or all of our techs and admins have died.

We are now 43 minutes into the POP3 service outage. At this time we need to page the Crucial Chief Technical Officer (CTO) so the he may coordinate a solution. Let’s look at that second escalation closer. Notice that the first_notification is set at 3? This is the same as the second escalations last_notification. This is an example of overlapping escalations and an excellent way to have more than one escalation notification active at one time.

We also have the last_notification set at 10, and the notification_interval set at 60, which means the CTO will be paged every hour for an additional 6 hours until the tenth notification is sent. At that time, the default service notification to the contact group admins takes over again.

It’s very important to notice that even though admins is the default and lowest contact_group on the escalation ladderm we continue to list the admin as a contact_group for each escalation. We do this so that those who have already been notified continue to know the status of the outage and so that in the event of a recovery all parties involved in the escalation will be notified that the service has recovered.

As you can see the ServiceEscalation object in Nagios is a very powerful tool. In only 3 minutes we are able to notify our staff admins. At 13 minutes we create a Level 3 technical support ticket, and at 43 minutes the CTO becomes involved in the problem. Larger networks will require much more detailed escalation configurations and all escalations for all services will need to be tuned to your particular environment.

The examples given here are primitive for the purpose of demonstration. We highly suggest that you read the Nagios manual for all configuration variables available to each object. Your notification and escalation needs will be different than those of Crucial, so it is important to think out your notification strategy prior to implementing the above changes as you could quickly become annoyed at false alarms. At Crucial, a false alarm is alarming.

For further information on Nagios and Service Escalations, please see links in the Reference section. If you haven’t already done so and you would like to install Nagios to follow along with this paper, now is a good time to install it and catch up. Soon, I’ll be posting the next section to this paper which will cover the monitoring of remote systems and services using the NRPE plugin. Following the NRPE paper, I’ll be posting a paper on EventHandlers—and when your server starts repairing itself, this will all be worth it.

I look forward to your comments and thank you for choosing Crucial.

Reference

  1. Nagios Software
  2. Nagios Manual
  3. Notification Escalations

Bookmark:  Del.icio.us · Digg · Furl · Google · Reddit · Technorati · Yahoo!

Subscribe Now

Subscribe to our blog by RSS or by email.

Related Posts

  1. Nagios Howto: Using NRPE To Monitor Remote Services
  2. Does Your Host Do This?
  3. Split-Shared Technical Analysis
  4. How Consumers Define Spam
  5. High Availability Split-Shared Hosting

Comments (4)

Trackbacks & Pingbacks

  1. Crucial Web Hosting » Blog » Does Your Host Do This?
  2. Tims Blog » Blog Archive » Talk about
  3. using eventhandler
  4. remote monitoring software

Leave A Reply

Helpful Hint

To post HTML or other code, wrap your text in the <code> tag.