monitoring critical servers - best practices

alfonso baldaserra alfonso.baldaserra at gmail.com
Thu Apr 16 14:02:48 CEST 2009


Greetings,

We are using Nagios version 3.0.6 on Fedora core 9.

I was just looking for some ideas how do you guys monitor critical servers
and services, what are the best practices etc.?

On a related note I just figured we have been missing a lot of alerts
lately.  Today we had to reboot couple of AIX servers which usually takes 5+
minutes.  Interesting thing is we did not receive any notification for these
servers.  Below is the host configuration entry

define host{
        name                     aix-server      ; The name of this host
template
        use                      generic-host    ; This template inherits
other values from the generic-host template
        check_period             24x7            ; By default, Linux hosts
are checked round the clock
        check_interval           2               ; Actively check the host
every 5 minutes
        retry_interval           1               ; Schedule host check
retries at 1 minute interval
        max_check_attempts       2               ; Check each Linux host 10
times (max)
        check_command            check-host-alive ; Default command to check
aix hosts
        notification_interval    10              ; Resend notifications
every 2 hours
        notification_options     d,u,r           ; Only send notifications
for specific host states
        contact_groups           aix-team        ; Notifications get sent to
the admins by default
        register                 0               ; DONT REGISTER THIS
DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
        }

I was just wondering what do I need to change if:

a server goes down
nagios check after 1 minute, as usual, and finds the server is down
nagios checks again after a minute and finds the server is still down
nagios sends notification and keep on sending notification after every 10
minutes until the server comes up again

I have checked nagios archives for check_interval, retry_interval and
max_check_attempts and as a result I got totally confused.

Any help is much appreciated.

P.S.  I request nagios developers to either change these options to
something more meaningful or provide some real life examples.  Apparently
there are many users which have been confused by these options as seen in
archives.

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20090416/98490822/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list