Master and slave servers for Nagios

Wheeler, JF (Jonathan) J.F.Wheeler at rl.ac.uk
Wed Apr 25 10:23:54 CEST 2007


As I have reported in the past I have 2 slave servers and a master
server; all checks should be run from the slave servers and passed back
to the master server.  I have been recently trying the understand why
the master server still has kernel "Out of memory" problems such that
the kernel starts killing active processes and, in some cases, panics
because there are no more processes to kill (this happens perhaps once
or twice per week usually around 4:50 - 5:10 in the morning).  As part
of my investigations I have noticed that for a typical host 40% of tests
are reported from the slave and 60% are run by the master.  I can tell
this because 40% of messages for this typical host in /var/log/nagios on
the master server begin "EXTERNAL_COMMAND" and 60% of messages begin
"Warning:".   My question is why this should be ?  Here is a copy of
nagios.log from the master server for one test of one host for today (so
far):

[1177369200] CURRENT SERVICE STATE: csflnx119;SPACE_TMP;OK;HARD;1;DISK
OK - free space: /tmp 672 MB (70% inode=99%):
[1177369894] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 41 seconds (threshold=1817 seconds).  I'm
forcing an immediate check of the service.
[1177370925] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space:
/tmp 672 MB (70% inode=99%):
[1177373014] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 43 seconds (threshold=2052 seconds).  I'm
forcing an immediate check of the service.
[1177374874] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 43 seconds (threshold=1816 seconds).  I'm
forcing an immediate check of the service.
[1177376734] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 41 seconds (threshold=1817 seconds).  I'm
forcing an immediate check of the service.
[1177377158] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space:
/tmp 672 MB (70% inode=99%):
[1177379494] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 33 seconds (threshold=2305 seconds).  I'm
forcing an immediate check of the service.
[1177381354] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 39 seconds (threshold=1818 seconds).  I'm
forcing an immediate check of the service.
[1177383214] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 43 seconds (threshold=1816 seconds).  I'm
forcing an immediate check of the service.
[1177387073] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space:
/tmp 660 MB (68% inode=99%):
[1177389102] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 13 seconds (threshold=5089 seconds).  I'm
forcing an immediate check of the service.
[1177390507] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space:
/tmp 660 MB (68% inode=99%):
[1177392635] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 11 seconds (threshold=2118 seconds).  I'm
forcing an immediate check of the service.
[1177394495] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 39 seconds (threshold=1818 seconds).  I'm
forcing an immediate check of the service.
[1177396362] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 36 seconds (threshold=1823 seconds).  I'm
forcing an immediate check of the service.
[1177397210] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space:
/tmp 660 MB (68% inode=99%):
[1177399813] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 47 seconds (threshold=2562 seconds).  I'm
forcing an immediate check of the service.
[1177401674] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 40 seconds (threshold=1818 seconds).  I'm
forcing an immediate check of the service.
[1177403749] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 28 seconds (threshold=1931 seconds).  I'm
forcing an immediate check of the service.
[1177404093] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space:
/tmp 660 MB (68% inode=99%):
[1177406037] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 42 seconds (threshold=1902 seconds).  I'm
forcing an immediate check of the service.
[1177410112] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 184 seconds (threshold=2853 seconds).  I'm
forcing an immediate check of the service.
[1177410863] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space:
/tmp 660 MB (68% inode=99%):
[1177413485] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 30 seconds (threshold=2579 seconds).  I'm
forcing an immediate check of the service.
[1177415948] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 40 seconds (threshold=2119 seconds).  I'm
forcing an immediate check of the service.
[1177417738] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space:
/tmp 660 MB (68% inode=99%):
[1177420390] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 29 seconds (threshold=2631 seconds).  I'm
forcing an immediate check of the service.
[1177423551] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 14 seconds (threshold=2481 seconds).  I'm
forcing an immediate check of the service.
[1177424385] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space:
/tmp 660 MB (68% inode=99%):
[1177426431] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 56 seconds (threshold=1990 seconds).  I'm
forcing an immediate check of the service.
[1177428291] Warning: The results of service 'SPACE_TMP' on host
'csflnx119' are stale by 43 seconds (threshold=1816 seconds).  I'm
forcing an immediate check of the service.

The nagios.log file on the slave server only contains the "CURRENT
SERVICE STATE:" entries for this server and test combination.  Why would
this be ?  Is it because the slave server is configured to
"obsess_over_services" ?  There are a few entries in the nagios.log file
for this host, but they refer only to Warnings (there were no critical
problems on this host).

I have compared the retention data file entries for this service and
they are not significantly different.  I have also run nagios -s
/etc/nagios/nagios.cfg on the master and the slave servers; the output
on both systems says "I have no suggestions - thinks look okay".  So has
the list any suggestions ?

Jonathan Wheeler
e-Science Centre
Rutherford Appleton Laboratory

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list