Dependency processing during network outage causing eventual server hang.

Robert Arends rarends at imc.net.au
Mon Aug 20 15:58:15 CEST 2007


Hi all,

I have a weird problem that I'd like to share and hopefully someone has
some insight.

Sorry for the lengthy dump below, but I find that producing a detailed
question/info makes the email dialog initially more efficient.

Environment:
FC2 / Nagios 2.6

With the following stats of Nagios (from Tactical)

Service Check Execution Time: 0.00 / 1.92 / 0.505 sec
Service Check Latency:        0.00 / 3.29 / 0.432 sec
Host Check Execution Time:    1.01 / 40.16 / 3.049 sec
Host Check Latency:           0.00 / 0.00 / 0.000 sec
# Active Host / Service Checks:  316 / 753
# Passive Host / Service Checks:   0 / 185

Most of the Passive service checks are SNMP Traps added via SNMPTT/SEC

The 316 hosts (each with at least 2 services - check_ping & check_snmp)
are broken in to 6 dependency trees.
Each tree has the following number of hosts - Ahl 161, Hbs 45, Ebs 14,
Imc 55, Ntl 35, Nut 6.
These represent customer's hosts and are accessed via a single link to
their network.
Each dependency starts off with a 'root' host (our end of the link to
the customer) and a single dependant host (the next hop).
After that the dependencies follow the routed path to each host.
All great so far.

The problem is that when the link to the customer fails, the behaviour
we have experienced repeatedly is the ultimate death of the server due
to high process and low RAM.  The server has 2 GB RAM and uses only
about 1GB in normal operation.

The chronology of events is thus:
1. link fails
2. a leaf host's service is reported as SOFT down.
3. The host is checked until 'max_check_attempts' are reached.
4. then before the host is reported in the log as HARD down, the parent
host in the dependency hierarchy is checked.
5. this repeats until the path is traced up to the "network outage"
root,  3 to 5 levels.
6. then this process seems to repeat for each and every service until
they are ultimately marked as unreachable due to the network outage.

All the while this is occurring, the "Scheduling Queue" does not move.
The server processes show a single Nagios process.
What seems to happen is that the whole Nagios system has become single
threaded and fixated on checking all services one elongated step at a
time.
Not even the hosts in *other* dependency trees are being processed.
The nagios.log shows snmp traps entering via the passive cmd interface,
but from within the gui, the "alert history" does not show them.

We've had the 'max_check_attempts' set to 12 and found the above
scenario ultimately (15-40 minutes) turns into a ...
"Warning: A system time change of 1116 seconds (forwards in time) has
been detected.  Compensating..."
Message in nagios.log.  This is followed by many of ...
"Warning: The check of service 'NTL-PING' on host 'ntl*****' looks like
it was orphaned (results never came back).  I'm scheduling an immediate
check of the service..."
At this point the number of process reaches in excess of 600 and it is
just a matter of time(30 mins) before the only option is to power off
the server.

This has happened 5 or 6 times before, today we tested this in a
controlled environment and reproduced it easily.
Next we reduced the 'max_check_attempts' to 2 and found the above
chronology is the same, but never got the time adjustment after 50
minutes, but did see more of the service/host/parent checks as mentioned
above.

As soon as the link was re-established, all the "Scheduling Queue" tasks
released and normal operation resumed (provided the server didn't die
first).

Has anyone seen this sort of thing before?
I've looked at the change-log for 2.7/2.8/2.9 to see if there are fixes
for this sort of thing but no luck.

Rob :-) 

________________________________

Robert Arends, Systems Engineer.
Direct 03 9863 1334 * Mobile 0412 412 345 * Email rarends at imc.net.au
Web www.imc.net.au * Helpdesk 1300 555 IMC * Managed Services 02 9006
8282 (24hrs)	 	
________________________________


This email and any attachments transmitted with it are confidential and may contain legally privileged information.  If you are not the intended recipient you are prohibited from disclosing, copying or using the information contained in it.  If you have received this email in error, please notify the sender immediately by return email and then delete all copies of this transmission together with any attachments.

It is the addressee's/recipient's duty to virus scan and otherwise test the email before loading it onto any computer system.  IMC Communications does not accept liability in connection with any computer virus, data corruption, delay, interruption, unauthorised access or unauthorised amendment in relation to this email.

For information about our privacy policy, visit the IMC Communications website at www.imc.net.au

This email has been checked by IMC's SMTP gateway.
-&-

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list