Nagios processes hang

Marantz, Roy Roy.Marantz at deshaw.com
Thu Sep 20 18:06:04 CEST 2007


FYI I think this is (mostly) fixed by adjusting the non-documented?
external_command_buffer_slots parameter to be a large value, I'm using
10000 and by turning OFF aggregate_writes in nsca.cfg.  I think there
may still be a problem in that passive check can take many minutes to be
reflected in the status display, but at least the server and nsca don't
hang/crash anymore.  If I find out anything else, I'll let the list
know.  Thanks.
Roy

P.S.  Nsca.c looks like it need some locking to prevent multiple nsca
(sub)processes using aggregated writes from clobbering each others
messages to the command pipe.  I saw indications of this happening in
the nagios server log.

-----Original Message-----
From: Andreas Ericsson [mailto:ae at op5.se] 
Sent: Sunday, September 16, 2007 6:46 PM
To: Marantz, Roy
Cc: 'nagios-users at lists.sourceforge.net'
Subject: Re: [Nagios-users] Nagios processes hang

Marantz, Roy wrote:
> I'm running Nagios 2.8 with around 1400 hosts and around 14000
services
> defined.  I have about 700 active service and the rest come in via
nsca.
> 
> My problem has a few symptoms:
> 1) I collect defunct Nagios processes, around 300 per day
> 2) the command pipe stops getting read so nsca is dumping data to its
> dump file
> 3) active service checks have very long (hours) latency
> 
> These all sound like the same problem to me, but I don't know how to
> diagnose it.  Any help would be appreciated.  I have run nagios -s and
> it doesn't suggest anything.  I'm using check_fping for host checks
and
> my remaining active service checks.  Attached is the output from
nagios
> -v and my nagios.cfg.  Thanks in advance for any help.


The trouble is the FIFO, which holds a maximum of 4096 bytes by default,
meaning it quickly becomes a bottleneck. Nagios tries to empty it as
soon
as there's data available on it, but fails to keep up with the data-spam
from nsca.

You could try re-nicing the nagios process, which might make it capable
of staying ahead of nsca.

Otherwise you could try modifying the FIFO size and recompile the
kernel.

Alternatively, patch nagios and nsca to use a unix socket and use
setsockopt() to up the read/write buffer on that socket to 256 KiB.

The fourth, and possibly tricksiest alternative, is to rewrite nsca as a
neb-module, have it run in a separate thread and update nagios' status
data directly. This last method will scale best but is by far the most
difficult.

Good luck

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list