monitoring critical servers - best practices

Sean McAfee smcafee at collaborativefusion.com
Wed May 6 15:58:11 CEST 2009


alfonso baldaserra wrote:
> i had been waiting for people to share their experience monitoring 
> mission critical systems but it seems there are not many people who do 
> that. 

I sure am, and I'd be willing to bet that a lot of people here are. 
You've got everything "right" in your configs.  You likely missed alerts 
because of a queue backup, which is usually caused by trying to run too 
many checks.

Every time a service check times out, the host is immediately checked. 
With a default value of 20 for max_concurrent_checks and typical timeout 
of 10 seconds for plugins, it could take 20 seconds for the first non-OK 
state during a server reboot.  If there are multiple servers being 
rebooted, Nagios may never run enough checks while the servers are down.

See 
http://nagios.sourceforge.net/docs/2_0/checkscheduling.html#problem_scheduling
for more info.


> p.s. now i am counting on nagios developers to expand on this topic 
> possibly by giving some real life examples.

What do you mean by real-life examples?

Generically, here's what I've done to make sure I'm promptly alerted 
when things go wrong:
- three facilities with a custom master + slave setup that has each 
slave checking their own facility's private LAN as well as all 
publically accessible corporate resources (public SMTP, DNS, etc...)
- customized self-promotion/self-demotion for the slaves if they lose 
contact with the master
- direct SMS and fallback email-to-SMS and email-to-email alerting for 
critical hosts and services
- sane configuration settings

The last one makes the most difference.  Because of the possibility for 
queue delays, you can't check everything all of the time.  Individual 
services are what's critical, not the hosts or everything they run.  If 
you have a "critical" machine that serves up a webapp, run check_http 
every minute, but there's no need to do the same check_ssh or check_ntp.

-- 
Sean McAfee
System Engineer

------------------------------------------------------------------------------
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list