Building a reliable uptime monitoring model

Kumar, Ashish xml.devel at gmail.com
Tue Mar 20 16:58:58 CET 2012


Greetings of the day,

We are trying to figure a reliable Uptime monitoring solution.  Sometimes
the server reboots too fast, within the limits of check_interval and
max_check_attempts, that Nagios misses to generate an alert.  Which
sometimes create a chaos and makes people lose faith in Nagios (no flame
wars please).

We have tried different solutions over the months and here are some
findings.

SNMPtraps sounds good but it has it's own cons and added complexity, so no
thanks
http://nagios.frank4dd.com/howto/windows-reboot-monitoring-nagios.htm

SNMP is out of question.  A good read for people relying on net-snmp for
uptime:
http://www.mail-archive.com/net-snmp-users@lists.sourceforge.net/msg27570.html

We rolled out NRPE for uptime and other monitoring requirements.  NRPE is
awesome but to avoid raising too many alerts we made Uptime checks
dependant upon NRPE (connection).  Now this creates it's own problems while
the server is rebooting.  On a rainy day there would be two alerts - NRPE
connection refused and then host going down, considering the fact that it
takes server a while before it shuts down all services before going down
itself.  However, on a snowy day there would be three alerts - NRPE
connection refused then server going down and later Uptime threshold is
less than (n) minutes.

SSH checks including all of them above are too bound to failure when the
server is under heavy load and not honouring any external requests, I am
sure most of us have witnessed that.

So I was wondering how is everyone reliably checking and notifying the
intended audience of server reboots with high rate of success.

Can we please use this thread to develop a robust uptime check model, if
there isn't one already?

Many thanks for your time.

Regards,
Ashish Kumar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20120320/89425941/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list