Problems with FreeBSD and Nagios

Andreas Ericsson ae at op5.se
Thu Dec 14 10:26:04 CET 2006


Jonathan Call wrote:
> I scanned the mailing list trying to find a solution for this. I found a
> brief discussion where someone had the same problem but there was
> nothing really discussed what was potentially wrong.
> 
> My system: 
> Dual 2.8GHz P4 processors
> 4GB of RAM
> FreeBSD 6.1-RELEASE-p10
> 
> Running processes:
> Nagios 2.6 (installed from ports without embedded perl or nanosleep)
> One mysqld process for the nagiosweb utility
> A few NSCA daemon processes for passive checking
> A backup tool daemon
> Apache+modssl (latest from ports)
> Basic FreeBSD services (sshd, sendmail, etc.)
> 
> Problem:
> Random service and host check control processes will lock up and 'spin'
> on the CPU. This is really bad when a host check does it because it
> brings all checks to a halt. It doesn't seem to even notice that all
> checks have gone stale.
> 
> It will look like this in top:
> 
>   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU
> COMMAND
> 94068 nagios      1 116    0  7500K  6748K CPU2   0 727:37 30.15% nagios
> 94082 nagios      1 116    0  7500K  6748K CPU2   0 734:28 32.55% nagios
> 94104 nagios      1 116    0  7500K  6748K CPU2   0 845:21 37.42% nagios
> 75338 nagios      5  20    0  7500K  6776K kserel 0  91:33  0.00% nagios
> 
> In this example the main nagios pid is 75338. The hung service and/or
> host processes are the other ones.
> 
> The service checks are almost entirely custom scripts, but the host
> check is a standard check_ping that comes with the nagios program.
> 
> Any ideas on how to figure out which service or host check is hung? Or
> how to deal with this problem at all?
> 

Host and service checks going into infinite loops wouldn't show up as 
Nagios processes in CPU spinlock, as the nagios check execution children 
just sit around and wait for the child to finish (or 60 seconds to pass 
in default config, before it kills it off).

You've found a bug in Nagios which most likely was either introduced in 
the port of it, or is a result of library differences between FreeBSD 
and Linux.

I wouldn't be all too surprised if it turns out that the FreeBSD pthread 
implementation disallows something that the Linux version allows. Note 
that this doesn't necessarily have to be a bug; Nagios doesn't use the 
pthread ABI in a way that is explicitly stated as safe, but the pthread 
implementation on Linux and most other unices are forgiving enough to 
make it work anyway.

It's also possible that this bug only triggers on dual-CPU systems with 
a particular library installed, as some kinds of timing and 
race-conditions just doesn't happen on single-CPU systems.

What happens if you do

$ gdb --pid=$(pidof spinning-nagios-process)
(gdb) bt

?

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list