Problems with nagios

Wheeler, JF (Jonathan) J.F.Wheeler at rl.ac.uk
Fri Mar 14 11:52:00 CET 2008


In the past I have reported problems when our master server has failed
with "Out of memory" problems caused by all server memory and swap space
being used up.  I have largely (but not completely) solved these by
increasing the number of "Command" and "Check result" buffers.  However
I would like some explanations of the following problems (note that I
run 1 master and 5 slave servers - shortly to be come 6 slaves; the
master server runs nagios, nsca and ndo2db daemons):

1. When I arrived this morning, there were 27000+ nsca processes waiting
to run.  Counting the number of processes showed that the number was
increasing by at least 10 per second.

2. Recently a restart of the nagios daemon (on the master server) has
hung after 27 seconds and does not reach completion.

3. For some restarts of the nagios daemon (for example, after a
configuration change), the command pipe cannot be created because there
is a normal file in its place - is this real file created by a nsca
process ?  Can I stop this happening ?

4. After a reboot of the master server to try to fix problems 1 and 2
above (I have tried restarting nsca and nagios, and killing many of the
nsca processes), the nagios daemon did not update any of its log files
(see the following outputs from command "nagiosstats -c
/etc/nagios/nagios.cfg":

Nagios Stats 2.10
Copyright (c) 2003-2007 Ethan Galstad (www.nagios.org)
Last Modified: 10-21-2007
License: GPL

CURRENT STATUS DATA
----------------------------------------------------
Status File:                          /var/log/nagios/tmpfs/status.dat
Status File Age:                      0d 0h 56m 56s
Status File Version:                  2.10

Program Running Time:                 0d 0h 57m 34s
Nagios PID:                           3229
Used/High/Total Command Buffers:      0 / 0 / 40960
Used/High/Total Check Result Buffers: 0 / 0 / 61440

Total Services:                       18688
Services Checked:                     18688
Services Scheduled:                   26
Active Service Checks:                4882
Passive Service Checks:               13806
Total Service State Change:           0.000 / 94.540 / 0.082 %
Active Service Latency:               0.207 / 94495564.236 / 19643.884
sec
Active Service Execution Time:        0.116 / 31.104 / 0.612 sec
Active Service State Change:          0.000 / 94.540 / 0.105 %
Active Services Last 1/5/15/60 min:   0 / 0 / 0 / 0
Passive Service State Change:         0.000 / 76.250 / 0.074 %
Passive Services Last 1/5/15/60 min:  0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:            17257 / 210 / 174 / 1047
Services Flapping:                    0
Services In Downtime:                 0

Total Hosts:                          907
Hosts Checked:                        901
Hosts Scheduled:                      0
Active Host Checks:                   907
Passive Host Checks:                  0
Total Host State Change:              0.000 / 20.000 / 0.162 %
Active Host Latency:                  0.000 / 235.096 / 4.491 sec
Active Host Execution Time:           0.000 / 10.127 / 0.358 sec
Active Host State Change:             0.000 / 20.000 / 0.162 %
Active Hosts Last 1/5/15/60 min:      0 / 0 / 0 / 0
Passive Host State Change:            0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:     0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                859 / 48 / 0
Hosts Flapping:                       0
Hosts In Downtime:                    0

Output from command "nagios -s /etc/nagios/nagios.cfg":


Nagios 2.10
Copyright (c) 1999-2007 Ethan Galstad (http://www.nagios.org)
Last Modified: 10-21-2007
License: GPL

Warning: Host 'Dont know 1 on 184' has no services associated with it!
Warning: Host 'Dont know 2 on 184' has no services associated with it!
Warning: Host 'babarams1' has no services associated with it!
Warning: Host 'babarams1-2' has no services associated with it!
Warning: Host 'babarams1-3' has no services associated with it!
Warning: Host 'babarams1-4' has no services associated with it!
Warning: Host 'babarams2' has no services associated with it!
Warning: Host 'babarams2-2' has no services associated with it!
Warning: Host 'babarams2-3' has no services associated with it!
Warning: Host 'babarams2-4' has no services associated with it!
Warning: Host 'c2certdb' has no services associated with it!
Warning: Host 'c2certdlf' has no services associated with it!
Warning: Host 'c2certlsf' has no services associated with it!
Warning: Host 'c2certns' has no services associated with it!
Warning: Host 'c2certstager' has no services associated with it!
Warning: Host 'ctsc18' has no services associated with it!
Warning: Host 'jra1dch01' has no services associated with it!
Warning: Host 'jra1dcp01' has no services associated with it!
Warning: Host 'swt-4400-1' has no services associated with it!
Warning: Host 'swt-5510-1' has no services associated with it!
Warning: Host 'swt-5510-2' has no services associated with it!
Warning: Host 'swt-5510-3' has no services associated with it!
Warning: Host 'swt-5530-0' has no services associated with it!
Warning: Host 'swt-55xx-ads' has no services associated with it!
Warning: Host 'swt001' has no services associated with it!
Warning: Host 'swt002' has no services associated with it!
Warning: Host 'swt003' has no services associated with it!
Warning: Host 'swt004' has no services associated with it!
Warning: Host 'swt005' has no services associated with it!
Warning: Host 'swt006' has no services associated with it!
Warning: Host 'swt007' has no services associated with it!
Warning: Host 'swt008' has no services associated with it!
Warning: Host 'swt010' has no services associated with it!
Warning: Contact 'guyDaytime' is not a member of any contact groups!
Warning: Contact group 'aix-ads-contacts-callout' is not used in any
host/service definitions or host/service escalations!
Warning: Contact group 'castor-contacts-build' is not used in any
host/service definitions or host/service escalations!
Warning: Contact group 'castor-contacts-preprod' is not used in any
host/service definitions or host/service escalations!
Warning: Contact group 'castor-contacts-srmV2' is not used in any
host/service definitions or host/service escalations!
Warning: Contact group 'corew' is not used in any host/service
definitions or host/service escalations!
Warning: Contact group 'tape-robot-contacts-callout' is not used in any
host/service definitions or host/service escalations!
Projected scheduling information for host and service
checks is listed below.  This information assumes that
you are going to start running Nagios with your current
config files.

HOST SCHEDULING INFORMATION
---------------------------
Total hosts:                     907
Total scheduled hosts:           0
Host inter-check delay method:   SMART
Average host check interval:     0.00 sec
Host inter-check delay:          0.00 sec
Max host check spread:           30 min
First scheduled check:           N/A
Last scheduled check:            N/A


SERVICE SCHEDULING INFORMATION
-------------------------------
Total services:                     18688
Total scheduled services:           21
Service inter-check delay method:   SMART
Average service check interval:     11742.86 sec
Inter-check delay:                  85.71 sec
Interleave factor method:           SMART
Average services per host:          20.60
Service interleave factor:          1
Max service check spread:           30 min
First scheduled check:              Wed Mar 12 10:10:11 2008
Last scheduled check:               Thu Mar 13 04:00:00 2008


CHECK PROCESSING INFORMATION
----------------------------
Service check reaper interval:      4 sec
Max concurrent service checks:      Unlimited


PERFORMANCE SUGGESTIONS
-----------------------
I have no suggestions - things look okay.

Output from command "cd /var/log/nagios; ls -ltr . rw tmpfs":

rw:
total 0
prw-rw----  1 nagios apache 0 Mar 12 09:41 nagios.cmd

.:
total 59264
-rw-rw-r--  1 nagios nagios     2483 Mar  5 08:39 downtime.log
drwxr-xr-x  2 nagios nagios    12288 Mar 12 00:00 archives
-rw-------  1 nagios nagios 22832729 Mar 12 08:41 retention.dat
-rw-r--r--  1 nagios nagios 15081485 Mar 12 08:42 objects.cache
drwxr-sr-x  2 nagios apache     4096 Mar 12 08:42 rw
-rw-rw-r--  1 nagios nagios    96471 Mar 12 08:42 comment.log
drwxrwxrwt  2 root   root         60 Mar 12 08:42 tmpfs
-rw-rw-r--  1 nagios nagios 22564667 Mar 12 08:42 nagios.log

tmpfs/:
total 20864
-rw-r--r--  1 nagios nagios 21333927 Mar 12 08:42 status.dat

Any comments, advice etc would be most appreciated as it is getting
rather frustrating when nagios does not perform reliably

Jonathan Wheeler
e-Science Centre
Rutherford Appleton Laboratory

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list