heads-up: trap avalanche depletes swap and leads to killing Nag, Apache, named, ...

Stanley Hopcroft Stanley.Hopcroft at IPAustralia.Gov.AU
Wed May 14 16:06:27 CEST 2003


Dear Ladies and Gentlemen,

This sites Nag shares a host with snmptrapd, bind, apache and the usual
suspects.

Nag is an ePN that can use up to half the 256 MB RAM (it is usually
cycled each month).

This evening a load balancer fired off ~ 120 traps in 20 minutes after
an iPlanet directory server (apparently) started 'running out of file
descriptors' and therefore binding and unbinding to the listening
socket.

tsitc> tail -800 nagios.log | grep -i process_ser | ./ns_log_localtime |
head
Wed May 14 19:13:09 EXTERNAL
COMMAND: PROCESS_SERVICE_CHECK_RESULT;ServerIron;SLB castor port
reachability trap;2;Failed. SLB cannot reach port 389 on real server
(server failure) castor (10.0.100.11).
 ...
Wed May 14 19:32:29 EXTERNAL
COMMAND: PROCESS_SERVICE_CHECK_RESULT;ServerIron;SLB castor port
reachability trap;0;Ok. SLB can reach port 389 on real server castor
(10.0.100.11).
Wed May 14 19:33:09 EXTERNAL
COMMAND: PROCESS_SERVICE_CHECK_RESULT;ServerIron;SLB castor port
reachability trap;2;Failed. SLB cannot reach port 389 on real server
(server failure) castor (10.0.100.11).
Wed May 14 19:33:09 EXTERNAL
COMMAND: PROCESS_SERVICE_CHECK_RESULT;ServerIron;SLB castor port
reachability trap;0;Ok. SLB can reach port 389 on real server castor
(10.0.100.11).
Wed May 14 19:33:09 EXTERNAL
COMMAND: PROCESS_SERVICE_CHECK_RESULT;ServerIron;SLB castor port
reachability trap;2;Failed. SLB cannot reach port 389 on real server
(server failure) castor (10.0.100.11).
Wed May 14 19:33:13 EXTERNAL
COMMAND: PROCESS_SERVICE_CHECK_RESULT;ServerIron;SLB castor port
reachability trap;0;Ok. SLB can reach port 389 on real server castor
(10.0.100.11).
tsitc> 

snmptrapd is configured to run a /bin/sh script that interprets the trap
and injects the process_service_check_result command into the command
queue.

On this occasion, however, apparently the swap became over committed
because

May 14 19:30:18 tsitc /kernel: swap_pager: out of swap space
May 14 19:30:18 tsitc /kernel: swap_pager_getswapspace: failed
May 14 19:30:18 tsitc /kernel: pid 143 (httpd), uid 0, was killed: out
of swap space
May 14 19:30:18 tsitc /kernel: pid 58178 (httpd), uid 80, was
killed: out of swap space
May 14 19:32:12 tsitc /kernel: swap_pager_getswapspace: failed
May 14 19:32:41 tsitc /kernel: pid 81284 (nagios), uid 1000, was
killed: out of swap space
May 14 19:33:09 tsitc /kernel: swap_pager_getswapspace: failed
May 14 19:33:11 tsitc last message repeated 112 times
May 14 19:33:11 tsitc /kernel: pid 78804 (nagios), uid 1000, was
killed: out of swap space
May 14 19:33:11 tsitc last message repeated 2 times
May 14 19:33:13 tsitc /kernel: pid 78074 (nagios), uid 1000, was
killed: out of swap space
May 14 19:33:15 tsitc /kernel: pid 54997 (nagios), uid 1000, was
killed: out of swap space
May 14 19:33:15 tsitc /kernel: pid 91002 (nagios), uid 1000, was
killed: out of swap space
May 14 19:42:12 tsitc /kernel: pid 75 (named), uid 53, was killed: out
of swap space

and I eventually realised that things were strangley quiet.

Part of the collateral included 2 or 3 copies of a shell listening on
port 162. Perhaps this was the forked copy of snmptrapd before the
execve had completed. These processes had to be killed manually before
snmptrapd could be restarted (and bind to that port).

Unfortunately, while I was using the host at the time, I failed to
notice the impact until I became aware of 120 Perl processes waiting on
the SMS lockfile.

Obviously this was the cause of the memory over commit: even though they
were all asleep they still occupied 5 MB of memory each.

There seems to be a need for me to rethink my notification tactics, but
in any case, a high rate or large number of service criticals is going
to make life hard for the Nag host.


Yours sincerely,





-- 
------------------------------------------------------------------------
Stanley Hopcroft
------------------------------------------------------------------------

'...No man is an island, entire of itself; every man is a piece of the
continent, a part of the main. If a clod be washed away by the sea,
Europe is the less, as well as if a promontory were, as well as if a
manor of thy friend's or of thine own were. Any man's death diminishes
me, because I am involved in mankind; and therefore never send to know
for whom the bell tolls; it tolls for thee...'

from Meditation 17, J Donne.


-------------------------------------------------------
Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
The only event dedicated to issues related to Linux enterprise solutions
www.enterpriselinuxforum.com

_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list