Lots of hosts, only a couple of services?

Andreas Ericsson ae at op5.se
Wed Aug 25 16:42:56 CEST 2004


Jason Byrns wrote:
> Thanks to everyone for their input, I certainly appreciate it.
> 
> To summarize, it sounds like the place to start is to change our service 
> checks from ping to telnet checks.  Or possibly even SNMP or something. 
>  I am also going to change check_host_alive settings, as it only sends 
> one packet now.  (It was already at five seconds and 100% packet loss 
> for critical status, which still seems fair.)
> 
> (Is there any advantage to checking SNMP instead of telnet?)
> 
> As someone else already mentioned, check_telnet is basically already 
> defined as "check_tcp -H (host address) -p 23".
> 

Be a bit wary about that one. Some admittedly stupid switches and 
routers seem to think it doesn't get RST when check_tcp is dropping the 
connection, so you might find yourself locked out of your own 
switch/router. Make sure you try it on one you can get your hands on for 
an Attila style reboot before setting it up to run against your favorite 
satellite.

> As for QoS, I'm not sure that's an option.  If one of our wireless 
> access points is too busy to reply, wouldn't the AP itself need some 
> kind of QoS features to help us?  I don't think they do, we've got a 
> mixture of older and a few newer Cisco access points, and those are 
> usually the ones that may miss a check or two here and there...
> 
> As for the max_check_attempts, and how it relates to host and service 
> checks, I believe I found my final answer in the Nagios FAQ pages. 
> However, after searching yesterday I couldn't find it again.  All I 
> could find was this page, which mentions exceptions to the monitoring 
> logic:
> http://nagios.sourceforge.net/docs/1_0/statetypes.html
> 
> ...but says it will not discuss those exceptions for now.
> 
> The information I found before basically stated what I said earlier: 
> when a single service check fails, a host check is triggered.  And if a 
> host check then also fails, it then chooses to skip the "soft" error 
> states and go straight to a "hard" error state.  In other words, ignore 
> the max_check_attempts and send out notifications right away.  And not 
> as a bug, but since, y'know, your HOST is down!  Not just a service!
> 
> But tweaking our host checks is probably the answer to any single false 
> positive warning.  Besides, I'm going to go ahead and slap Nagios onto 
> one of my test servers, and put together a very simple setup to test 
> again how Nagios handles service and host checks and max_check_attempts. 
>  I'm virtually certain that we were being warned every time, after any 
> host failed just a single check, even though my settings look like it 
> should take five failed checks in a row.
> 

Again, host checks are run in a non-delayed serialized manner, meaning a 
max_check_attempts of 4 would yield 40 seconds worth of trying before 
deciding its down (assuming you use a check-host-alive timeout value of 10).

> Thanks again, everybody!
> 
> -- 
> Jason Byrns
> System Administrator, MicroLnk
> 

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Lead Developer


-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list