Fix for host dependency checks

Holger Weiss holger at CIS.FU-Berlin.DE
Wed Mar 22 01:48:39 CET 2006


* Ethan Galstad <nagios at nagios.org> [2006-03-21 12:50]:
> On 24 Feb 2006 at 19:04, Holger Weiss wrote:
> > * Holger Weiss <holger at CIS.FU-Berlin.DE> [2006-01-30 16:54]:
> > > There is a timing problem in the host[*] dependency check logic: If
> > > host B is configured to be dependent on host A being up and host A
> > > goes down, the dependency will only fail if host A "incidentally"
> > > was checked _prior_ to host B after going down.  Hence, the host
> > > dependency logic will sometimes work and sometimes not.  I'd
> > > therefore suggest to explicitly (re-)check host A during the
> > > dependency checking for host B, as the attached patch does.
> >
> > Okay, this introduces a new problem: If host B is checked immediately
> > before and host A (during the dependency check) after a recovery of
> > both hosts, the dependency won't fail.  Hence, notifications for host
> > B won't be suppressed (been there, got the t-shirt).
> >
> > Next try: The attached patch lets the dependency fail if either the
> > current or the previous (hard) state of A matches the failure
> > criteria. AFAICS, this should reliably suppress notifications for host
> > B if the dependency fails.
>
> I'll keep this on the TODO list for Nagios 3.x, but I think it might
> require some more thought.  The last hard state of the host should
> only be used in the dependency logic if a state change occurred
> relatively recently.  If, for example, the last hard state change
> occurred two days ago, you don't want that value used in the logic.

Okay, but the current Nagios code uses _only_ the last hard state (no
matter how "old" it is), which is the reason why I've encountered the
problem in the first place.  I thought about checking the freshness of
the last hard state myself (the information is available in the host
struct already, so this would be easy), but then I omitted that since
letting the dependency fail if either the current or the last hard state
matches the criteria seemed sufficiently safe to me.  This way, "false
alarms" for the (dependent) host B should reliably be prevented, while
the risk of suppressing legitimate notifications for B because the
dependency fails due to an outdated last hard state of A is the same as
with the current Nagios code.  I believe that in practice, this risk is
very low: I suppose that in almost all cases, the configured dependency
criteria will be a down and/or unreachable state.  So the risk would be
that an outdated down or unreachable state lets the dependency fail, but
down and unreachable states should normally be more or less up-to-date.

In any case, many thanks for looking into this issue!

Holger

-- 
PGP fingerprint:  F1F0 9071 8084 A426 DD59  9839 59D3 F3A1 B8B5 D3DE


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642




More information about the Developers mailing list