[Nagios-devel] Re: Perceived problem with host checks

SyBase sybase at vantageltd.com
Sun Sep 15 09:40:41 CEST 2002


This may be by design, but IMHO this logic is flawed. Say you have a 
problem with a particular host service that first caused the box to 
reboot (setting host state to down) then when the box comes back up the 
service fails to start. Your monitoring tool will falsely continue to 
report the host as down when this is not the case. If shooting for 
accuracy is the idea (which I would think it would be) then maybe this 
should be changed?


Russell Scibetti wrote:

> We were confused by this at first too, but believe it or not, the 
> behavior you saw is what is expected.  You said that when you 
> eventually turned HTTP back on, both the host and service came back 
> up.  The way the nagios logic works is:
>
> 1.  check the service - if it fails...
> 2.  check the host - it it fails (incl. the retries)...
> 3.  host and service are now in a Hard non-OK state
> 4.  Wait the service's normal_check_interval
> 5.  Run the service (NOT the host) check
> 6.  If the service is still down, then the host must still be down.
> 7.  Wait the service's check interval.....repeat endlessly
>
> The Nagios logic appears to be "well, if the host is down, we can tell 
> its back up when any of the services are running again" - similar 
> logic to "if a service is running, the host must be fine."  Also, 
> you'll see there is no check_interval for hosts.  This is because it 
> uses the service checks as the basis for the monitoring logic.
>
> This is at least what we can tell.  If someone know's something else, 
> please share.
>
> -Russell Scibetti
>
> John Fox wrote:
>
>> Hello,
>>
>> I'm configuring a Nagios 1.0b4 installation.  It's the first time I've
>> used this product, and I've run into somewhat of a stumbling block.
>>
>> Both hosts used in my tests are running FreeBSD 4.6-STABLE and nagios
>> is installed via the ports system.
>>
>> That said, here are the details:
>>
>> I've configured nagios to do host checks for host A" and service
>> checks for HTTPD on A.
>>
>> I start HTTPD on host A and fire up nagios (in daemon mode) on host B.
>>
>> Everything is fine.  Host and service are both marked up UP.
>>
>> I use ipfw to disable ICMP on host A. This is done with the intent of
>> provoking a host check, knowing that the host-check test makes use of
>> ping.
>>
>> Host continues to remain marked as up.  This makes sense to me, given
>> that HTTPD is still running and accessible there.
>>
>> I kill HTTPD on A. 
>> Both host and service become marked as 'down' and I begin to
>> receive problem notifications.
>>
>> I enable ICMP on A, knowing that the host-check-alive command
>> makes us of 'check_ping' plugin, and expecting that host A will
>> soon be marked as 'UP'.
>>
>> But that does not happen; the host continues to be marked as down.  I
>> watch the various status screen and see multiple host tests
>> performed. I recieve multiple problem notifications.
>>
>> I'm flummoxed by this, and login to host B (the nagios machine) and 
>> veryify that I can ping A from there.  I can.  I then run
>> check-host-alive's "check_ping" plugin from the command line.  It
>> instantly returns with a "PING OK" response. (Note: I used the exact
>> same command structure as nagios would -- I took it from the
>> 'check-host-alive' definition found in 'checkcommands.cfg'.)
>>
>> Yet the 'Host Information' pages shows the Status info as
>> "Critical -- Plugin timed out after 10 seconds".
>>
>> So to all appearances, nagios and I are getting different results
>> from the exact same command line.  I don't believe this is what's
>> really going on, because it seems absurd to me.  So I go to the FAQ.
>>
>> I see a question that seems to apply: "Hosts are incorrectly listed
>> as being DOWN or UNREACHABLE".  But after reading it, I'm not sure
>> that it does apply.
>>
>> The way I read it, nagios didn't perform any host checks on A until
>> A's HTTPD went down.  Makes sense.
>>
>> At which point a host check is performed -- if the host check doesn't
>> return 'OK', it is run again and again until it has made 
>> max_check_attempts (from the host definition) attemps OR recieved
>> an "OK' response.
>>
>> My max_check_attempts is set to 3.  But in observing the various
>> status screen, I saw the "Last Status Check" value changing every 3
>> minutes.  In the course of this test, I allowed the downtime to reach
>> 46 minutes, which to me indicates that 15 host checks were run.
>> Obviously, this is a much larger number than 3.  And certainly it
>> seems that the plugin never recieved an 'OK' response.  This is quite
>> a conundrum to me!
>>
>> I then restarted HTTPD on host A.  Within three minutes, this service
>> was once again marked as 'UP' and the host, too, was again marked as
>> 'UP', with the 'Host Information' pages "Status Information" field
>> reading "PING OK...".
>>
>> On the off chance that my IPFW/ping machinations were somehow causing
>> wierdness, I repeated the same basic experiment, but rather than
>> disabling ICMP, I ifconfig'd my network card down.  And rather than
>> re-enabling ICMP, I ifconfig'd the interface back up.  This resulted
>> in the same behavior as the previous test.
>>
>> I don't see this as a major issue, given that a successful service
>> check causes the host to be again considered 'UP'.  But it troubles me
>> to not understand the behavior I'm seeing, as I'm simply unable to
>> account for it.
>>
>> Any advice or thoughts would be very much welcomed!
>>
>>
>> Thanks in advance,
>>
>>
>> John
>>
>




-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf




More information about the Users mailing list