Child host becomes UNREACHABLE when parent changes from UP to a SOFT DOWN state

Aidan Anderson mail at aidananderson.co.uk
Wed Apr 7 12:16:25 CEST 2010


Hi List!

I am in the process of upgrading from v2.12 to v3.2.1.  As well as 
upgrading, I am taking the opportunity to move to a new server at the 
same time.  This has allowed me to run both versions in tandem to 
compare the operation of the two versions.

One difference I noticed straight away was downtime duration on certain 
hosts.  For example, v2 would show a host down for over 2 days yet v3 
would show the same host as being down for only a few hours.  On 
investigation, it turned out that the parent of the host on v3 went into 
a soft down state.  This changed the host in question to an unreachable 
state.  The parent host recovered within a minute or so and changed the 
host back to a down state, effectively resetting the down duration back 
to zero.  I would have expected that the child host should only change 
state if the parent goes into a hard down state, not a soft down state.

I googled for the issue and found one related post from just over a year 
ago:

http://www.mail-archive.com/nagios-users@lists.sourceforge.net/msg25543.html

The poster was given various suggestions to circumvent the problem, i.e. 
tweaking flap detection, increasing time-out on the plugin etc but 
nothing that seemed to resolve his issue.

The posters main problem with this behaviour was that he was getting 
down e-mail alerts for hosts that are already down due to the state 
changes.  My issue is not with repeated alerts but with the accuracy of 
the down duration of the host.  When our support department look to 
resolve host problems, they will try and resolve the oldest problems 
first for obvious reasons of fairness to our customers.  This scenario 
breaks this.  In v3, to get an accurate downtime for a host, you would 
now have to trawl through the alert history or run a trend report for 
the host to find out when the host really went down.

Version 2 does not exhibit this problem.  I don't think this is by 
design but purely down to the way serial host checks work in v2.  When a 
host goes into a soft down state in v2, Nagios cannot do anything else 
until it has completed all the retries or the host recovers so Nagios 
never gets the chance to mark the child host unreachable unless it 
reaches max_check_attempts and determines that the parent host really is 
down.

The original poster of this problem made a good point that Nagios has 
all the tolerance built in to avoid false alarms on host checks but 
unfortunately this logic doesn't carry on through child hosts.

I can't see that the current way v3 deals with parent/child problems as 
being desirable for most people, although it seems to have only bothered 
2 of us!

Thoughts?

regards,
Aidan


------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list