Dependency processing during network outagecausing eventual server hang.

Robert Arends rarends at imc.net.au
Tue Aug 21 01:22:34 CEST 2007


Marc, 

Thanks for your answer.
We are indeed using the PARENTS directive, apologies for using the wrong
word - I had no idea until reading further last night that there are
also dependency directives.  From my perspective the PARENTS directive
produced a dependant hierarchy, so I called it 'dependency'. We also
came from using Whats Up Gold, and there the parents concept is called
dependency.

The other RAM user is MRTG, but RAM is not the issue. RAM use does not
climb except when 600 processes start at once.!!!

:Marc said:
Host check processing is a serial process. Nagios 2 and prior stops
_all_ other processing while hosts are being checked up to
max_check_attempts
::

Wow, ok so we can tune this to reduce the impact of the problem.

Now that you know we are using parents logic, can you revisit your
answer? Especially re point 6 below.

> 6. then this process seems to repeat for each and every service until
> they are _all_ ultimately marked as unreachable due to the network
outage.

Based on the above experience (point 6)...
Even with max_check_attempts set to 2, that would be ...
161 leaf-hosts x ~2 services + ~100 parents = 422 x 2 minutes = 844
minutes to fully check the entire outage and return to checking the
other parent trees.

I think you are saying that Nagios 3 continues to check other hosts
while dealing with a network outage - how stable is v3?

Rob :-) 


-----Original Message-----
From: nagios-users-bounces at lists.sourceforge.net
[mailto:nagios-users-bounces at lists.sourceforge.net] On Behalf Of Marc
Powell
Sent: Tuesday, August 21, 2007 12:38 AM
To: nagios-users at lists.sourceforge.net
Subject: Re: [Nagios-users] Dependency processing during network
outagecausing eventual server hang.

Preface - I don't use dependencies, but...

> -----Original Message-----
> From: nagios-users-bounces at lists.sourceforge.net [mailto:nagios-users-
> bounces at lists.sourceforge.net] On Behalf Of Robert Arends
> Sent: Monday, August 20, 2007 8:58 AM
> To: nagios-users at lists.sourceforge.net
> Subject: [Nagios-users] Dependency processing during network outage
> causing eventual server hang.
> 


> Each dependency starts off with a 'root' host (our end of the link to
> the customer) and a single dependant host (the next hop).
> After that the dependencies follow the routed path to each host.
> All great so far.

I'd suggest that using the 'parents' directive in your host definitions
is probably a better way to accomplish the above rather than
dependencies. Your use appears to be exactly what it was meant for.
 
> The problem is that when the link to the customer fails, the behaviour
> we have experienced repeatedly is the ultimate death of the server due
> to high process and low RAM.  The server has 2 GB RAM and uses only
> about 1GB in normal operation.

What's using the extra RAM?
 
> The chronology of events is thus:
> 1. link fails
> 2. a leaf host's service is reported as SOFT down.
> 3. The host is checked until 'max_check_attempts' are reached.
> 4. then before the host is reported in the log as HARD down, the
parent
> host in the dependency hierarchy is checked.
> 5. this repeats until the path is traced up to the "network outage"
> root,  3 to 5 levels.
> 6. then this process seems to repeat for each and every service until
> they are ultimately marked as unreachable due to the network outage.

1-5 appear normal. 6 is probably because you're using dependencies but I
would expect nagios to use the last host check state instead of
re-checking. I'm a bit more familiar with the parents logic though...
 
> All the while this is occurring, the "Scheduling Queue" does not move.
> The server processes show a single Nagios process.
> What seems to happen is that the whole Nagios system has become single
> threaded and fixated on checking all services one elongated step at a
> time.

Yup. That's well known behavior. Host check processing is a serial
process. Nagios 2 and prior stops _all_ other processing while hosts are
being checked up to max_check_attempts. You want your host checks to
complete as quickly as possible. Only the minimum amount of pings (if
that's what you use), repeated the minimum amount of check_attempts to
satisfy you that the host is really down.

> As soon as the link was re-established, all the "Scheduling Queue"
tasks
> released and normal operation resumed (provided the server didn't die
> first).

A likely explanation is that host check results return an OK state and
nagios moves on to the next task immediately.
 
--
Marc


------------------------------------------------------------------------
-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when
reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null

This email and any attachments transmitted with it are confidential and may contain legally privileged information.  If you are not the intended recipient you are prohibited from disclosing, copying or using the information contained in it.  If you have received this email in error, please notify the sender immediately by return email and then delete all copies of this transmission together with any attachments.

It is the addressee's/recipient's duty to virus scan and otherwise test the email before loading it onto any computer system.  IMC Communications does not accept liability in connection with any computer virus, data corruption, delay, interruption, unauthorised access or unauthorised amendment in relation to this email.

For information about our privacy policy, visit the IMC Communications website at www.imc.net.au

This email has been checked by IMC's SMTP gateway.
-&-

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list