Nagios 3.1.2 gives (Return code of 127 is out of bounds - plugin may be missing) when MACRO is too long

Jeffrey Lensen jeffrey at hyves.nl
Fri Mar 5 14:41:25 CET 2010


Hey,

Yesterday I added a servicegroup (called NRPEchecks) which contains 23.660
host-service entries to our configuration. After restarting the Nagios
server, our Central Nagios server started complaining when it had to perform
(forced) NRPE checks and/or send notifications. Everything responded with
"(Return code of 127 is out of bounds - plugin may be missing)".
Oddly enough, this only seemed to happen with NRPE checks (and notification
commands) which call upon a (bash) script. NRPE checks which do nothing more
than '/usr/bin/nrpe --help | grep "Version"' were performed without any
problems what so ever. SNMP check scripts and TCP checks (which run on the
Nagios hosts) worked perfectly fine aswel.
Checks performed by the distributed servers we're still being processed
without issues by the Central Nagios server.
Performing the NRPE check command by hand as user Nagios works perfectly
fine. So the file exists and is executable for Nagios.
When I remove the servicegroup from the configs, everything runs smoothly
again..

Some background information of our distributed setup:
- Nagios Version                          3.1.2
- Central servers:                         1
- Distributed servers:                    8
- Total Hosts:                              2.798
- Total Services:                          49.579
- Service groups:                         10
- NRPEchecks servicegroup:        23.660 entries

Sizes of the objects.cache file on the Central Nagios server:
- Lines:                           2.083.738
- Words:                         4.079.205
- Characters:                   51.916.032
- Size:                            50M


Here is the some logfile output:
Mar  5 10:11:05 src at nagioshost nagios: Warning: Return code of 127 for check
of service 'Disk' on host 'testhost' was out of bounds. Make sure the plugin
you're trying to run actually exists.
Mar  5 10:21:07 src at nagioshost nagios: Warning: Return code of 127 for check
of service 'Disk' on host 'testhost' was out of bounds. Make sure the plugin
you're trying to run actually exists.
Mar  5 10:11:05 src at nagioshost nagios: SERVICE
ALERT: testhost;Disk;CRITICAL;SOFT;1;(Return code of 127 is out of bounds -
plugin may be missing)
Mar  5 10:21:07 src at nagioshost nagios: SERVICE
ALERT: testhost;Disk;CRITICAL;SOFT;2;(Return code of 127 is out of bounds -
plugin may be missing)


The relevant parts of the configuration look like this:
- NRPE.cfg:
command[check_disk]=sh /usr/nagios/scripts/check_disk.sh

- Nagios cfgs:
define host{
    use                       generic-host
    host_name             testhost
    alias                      testhost
    address                 testhost.server
}

define service{
    use                             Disks-check
    hostgroup_name          testhosts
    service_description      Disks
}

define service{
    use                           generic-service
    name                        Disks-check
    check_command        check_nrpe_disk
    servicegroups            NRPEchecks
    contact_groups          admins
    register                      0
}

define servicegroup{
    servicegroup_name    NRPEchecks
    alias                         NRPEchecks
}

define command{
    command_name      check_nrpe_disk
    command_line         $USER1$/check_nrpe -t 30 -H $HOSTADDRESS$ -c
check_disk
}

- resource.cfg:
$USER1$=/usr/lib/nagios/plugins


When forcing a check on the Central Nagios server the Nagios debugging
output showed this:
[1267785907.716681] [008.0] [pid=24083] ** Timed Event ** Type: 0, Run Time:
Fri Mar  5 11:37:31 2010
[1267785907.716686] [008.0] [pid=24083] ** Service Check Event ==> Host:
'testhost', Service: 'Disk', Options: 1, Latency: 456.716000 sec
[1267785907.716697] [001.0] [pid=24083] run_scheduled_service_check() start
[1267785907.716701] [016.0] [pid=24083] Attempting to run scheduled check of
service 'Disk' on host 'testhost': check options=1, latency=456.716000
[1267785907.716708] [001.0] [pid=24083] run_async_service_check()
[1267785907.716712] [001.0] [pid=24083] check_service_check_viability()
[1267785907.716716] [016.0] [pid=24083] Checking service 'Disk' on host
'testhost'...
[1267785907.716732] [001.0] [pid=24083] get_raw_command_line()
[1267785907.716738] [2320.2] [pid=24083] Raw Command Input:
$USER1$/check_nrpe -t 30 -H $HOSTADDRESS$ -c check_disk
[1267785907.716744] [2320.2] [pid=24083] Expanded Command Output:
$USER1$/check_nrpe -t 30 -H $HOSTADDRESS$ -c check_disk
[1267785907.716748] [001.0] [pid=24083] process_macros()
[1267785907.716753] [2048.1] [pid=24083] **** BEGIN MACRO PROCESSING
***********
[1267785907.716756] [2048.1] [pid=24083] Processing: '$USER1$/check_nrpe -t
30 -H $HOSTADDRESS$ -c check_disk'
[1267785907.716760] [2048.2] [pid=24083]   Processing part: ''
[1267785907.716766] [2048.2] [pid=24083]   Not currently in macro.  Running
output (0): ''
[1267785907.716785] [2048.2] [pid=24083]   Processing part: 'USER1'
[1267785907.716794] [2048.2] [pid=24083]   Processed 'USER1', Clean Options:
0, Free: 0
[1267785907.716798] [2048.2] [pid=24083]   Processed 'USER1', Clean Options:
0, Free: 0
[1267785907.716802] [2048.2] [pid=24083]   Cleaning options: global=0,
local=0, effective=0
[1267785907.716806] [2048.2] [pid=24083]   Uncleaned macro.  Running output
(23): '/usr/lib/nagios/plugins'
[1267785907.716810] [2048.2] [pid=24083]   Just finished macro.  Running
output (23): '/usr/lib/nagios/plugins'
[1267785907.716814] [2048.2] [pid=24083]   Processing part: '/check_nrpe -t
30 -H '
[1267785907.716818] [2048.2] [pid=24083]   Not currently in macro.  Running
output (44): '/usr/lib/nagios/plugins/check_nrpe -t 30 -H '
[1267785907.716821] [2048.2] [pid=24083]   Processing part: 'HOSTADDRESS'
[1267785907.716825] [2048.2] [pid=24083]   macro_x[2] (HOSTADDRESS) match.
[1267785907.716831] [2048.2] [pid=24083]   Processed 'HOSTADDRESS', Clean
Options: 0, Free: 1
[1267785907.716834] [2048.2] [pid=24083]   Processed 'HOSTADDRESS', Clean
Options: 0, Free: 1
[1267785907.716838] [2048.2] [pid=24083]   Cleaning options: global=0,
local=0, effective=0
[1267785907.716843] [2048.2] [pid=24083]   Uncleaned macro.  Running output
(64): '/usr/lib/nagios/plugins/check_nrpe -t 30 -H testhost.server'
[1267785907.716847] [2048.2] [pid=24083]   Just finished macro.  Running
output (64): '/usr/lib/nagios/plugins/check_nrpe -t 30 -H testhost.server'
[1267785907.716851] [2048.2] [pid=24083]   Processing part: ' -c check_disk'
[1267785907.716856] [2048.2] [pid=24083]   Not currently in macro.  Running
output (91): '/usr/lib/nagios/plugins/check_nrpe -t 30 -H testhost.server
-c check_disk'
[1267785907.716860] [2048.1] [pid=24083]   Done.  Final output:
'/usr/lib/nagios/plugins/check_nrpe -t 30 -H testhost.server -c check_disk'
[1267785907.716863] [2048.1] [pid=24083] **** END MACRO PROCESSING
*************
[1267785907.716896] [016.1] [pid=24083] Check result output will be written
to '/var/nagios/spool/checkresults/checkbmXvnd' (fd=8)
[1267785907.722094] [016.2] [pid=24083] Service check is executing in child
process (pid=26053)
[1267785907.722949] [001.0] [pid=26053] process_macros()
[1267785907.722988] [001.0] [pid=26053] process_macros()
[1267785907.722996] [001.0] [pid=26053] process_macros()
[1267785907.723003] [001.0] [pid=26053] process_macros()
[1267785907.723010] [001.0] [pid=26053] process_macros()
[1267785907.723017] [001.0] [pid=26053] process_macros()
[1267785907.738768] [001.0] [pid=26053] process_macros()
[1267785907.738803] [001.0] [pid=26053] process_macros()
[1267785907.738817] [001.0] [pid=26053] process_macros()
[1267785907.738827] [001.0] [pid=26053] process_macros()
[1267785907.738855] [001.0] [pid=26053] process_macros()
[1267785907.738865] [001.0] [pid=26053] process_macros()
[1267785912.044580] [016.2] [pid=26056] Moving temp check result file
'/var/nagios/spool/checkresults/checkbmXvnd' to queue file
'/var/nagios/spool/checkresults/cvxmTpo'...
[1267785912.274671] [001.0] [pid=24083] handle_timed_event() end
[1267785912.274709] [008.1] [pid=24083] ** Event Check Loop
[1267785912.274743] [008.1] [pid=24083] Next High Priority Event Time: Fri
Mar  5 11:45:08 2010
[1267785912.274753] [008.1] [pid=24083] Next Low Priority Event Time:  Fri
Mar  5 11:42:29 2010
[1267785912.274757] [008.1] [pid=24083] Current/Max Service Checks: 1/0
[1267785912.274763] [001.0] [pid=24083] handle_timed_event() start
[1267785912.274771] [008.0] [pid=24083] ** Timed Event ** Type: 5, Run Time:
Fri Mar  5 11:45:08 2010
[1267785912.274776] [008.0] [pid=24083] ** Check Result Reaper
[1267785912.274780] [001.0] [pid=24083] reap_check_results() start
[1267785912.274783] [016.0] [pid=24083] Starting to reap check results.
[1267785912.274807] [016.1] [pid=24083] Starting to read check result queue
'/var/nagios/spool/checkresults'...
[1267785912.274836] [016.1] [pid=24083] Processing check result file:
'/var/nagios/spool/checkresults/cvxmTpo'
[1267785912.274961] [016.2] [pid=24083] Found a check result (#1) to
handle...
[1267785912.274978] [016.1] [pid=24083] Handling check result for service
'Disk' on host 'testhost'...
[1267785912.274983] [001.0] [pid=24083] handle_async_service_check_result()
[1267785912.274986] [016.0] [pid=24083] ** Handling check result for service
'Disk' on host 'testhost'...
[1267785912.274990] [016.1] [pid=24083] HOST: testhost, SERVICE: Disk, CHECK
TYPE: Active, OPTIONS: 1, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes,
RETURN CODE: 127, OUTPUT: (null)
[1267785912.275056] [016.2] [pid=24083] ST: SOFT  CA: 3  MA: 3  CS: 2  LS: 2
 LHS: 0
[1267785912.275062] [016.2] [pid=24083] Service had a HARD STATE CHANGE!!
[1267785912.275066] [016.1] [pid=24083] Service is in a non-OK state!
[1267785912.275070] [016.1] [pid=24083] Host is currently UP, so we'll
recheck its state to make sure...
[1267785912.275074] [016.1] [pid=24083] * Using last known host state: 0
[1267785912.275078] [016.1] [pid=24083] Current/Max Attempt(s): 3/3
[1267785912.275082] [016.1] [pid=24083] Service has reached max number of
rechecks, so we'll handle the error...
[1267785912.275093] [001.0] [pid=24083] process_macros()
[1267785912.275097] [2048.1] [pid=24083] **** BEGIN MACRO PROCESSING
***********
[1267785912.275101] [2048.1] [pid=24083] Processing: 'SERVICE
ALERT: testhost;Disk;$SERVICESTATE$;$SERVICESTATETYPE$;$SERVICEATTEMPT$;(Return
code of 127 is out of bounds - plugin may be missing)
'
[1267785912.275105] [2048.2] [pid=24083]   Processing part: 'SERVICE
ALERT: testhost;Disk;'
[1267785912.275109] [2048.2] [pid=24083]   Not currently in macro.  Running
output (40): 'SERVICE ALERT: testhost;Disk;'
[1267785912.275114] [2048.2] [pid=24083]   Processing part: 'SERVICESTATE'
[1267785912.275118] [2048.2] [pid=24083]   macro_x[4] (SERVICESTATE) match.
[1267785912.275124] [2048.2] [pid=24083]   Processed 'SERVICESTATE', Clean
Options: 0, Free: 1
[1267785912.275128] [2048.2] [pid=24083]   Processed 'SERVICESTATE', Clean
Options: 0, Free: 1
[1267785912.275131] [2048.2] [pid=24083]   Cleaning options: global=0,
local=0, effective=0
[1267785912.275136] [2048.2] [pid=24083]   Uncleaned macro.  Running output
(48): 'SERVICE ALERT: testhost;Disk;CRITICAL'
[1267785912.275140] [2048.2] [pid=24083]   Just finished macro.  Running
output (48): 'SERVICE ALERT: testhost;Disk;CRITICAL'
[1267785912.275144] [2048.2] [pid=24083]   Processing part: ';'
[1267785912.275148] [2048.2] [pid=24083]   Not currently in macro.  Running
output (49): 'SERVICE ALERT: testhost;Disk;CRITICAL;'
[1267785912.275152] [2048.2] [pid=24083]   Processing part:
'SERVICESTATETYPE'
[1267785912.275156] [2048.2] [pid=24083]   macro_x[42] (SERVICESTATETYPE)
match.
[1267785912.275160] [2048.2] [pid=24083]   Processed 'SERVICESTATETYPE',
Clean Options: 0, Free: 1
[1267785912.275165] [2048.2] [pid=24083]   Processed 'SERVICESTATETYPE',
Clean Options: 0, Free: 1
[1267785912.275168] [2048.2] [pid=24083]   Cleaning options: global=0,
local=0, effective=0
[1267785912.275172] [2048.2] [pid=24083]   Uncleaned macro.  Running output
(53): 'SERVICE ALERT: testhost;Disk;CRITICAL;HARD'
[1267785912.275176] [2048.2] [pid=24083]   Just finished macro.  Running
output (53): 'SERVICE ALERT: testhost;Disk;CRITICAL;HARD'
[1267785912.275180] [2048.2] [pid=24083]   Processing part: ';'
[1267785912.275183] [2048.2] [pid=24083]   Not currently in macro.  Running
output (54): 'SERVICE ALERT: testhost;Disk;CRITICAL;HARD;'
[1267785912.275187] [2048.2] [pid=24083]   Processing part: 'SERVICEATTEMPT'
[1267785912.275191] [2048.2] [pid=24083]   macro_x[6] (SERVICEATTEMPT)
match.
[1267785912.275196] [2048.2] [pid=24083]   Processed 'SERVICEATTEMPT', Clean
Options: 0, Free: 1
[1267785912.275200] [2048.2] [pid=24083]   Processed 'SERVICEATTEMPT', Clean
Options: 0, Free: 1
[1267785912.275204] [2048.2] [pid=24083]   Cleaning options: global=0,
local=0, effective=0
[1267785912.275208] [2048.2] [pid=24083]   Uncleaned macro.  Running output
(55): 'SERVICE ALERT: testhost;Disk;CRITICAL;HARD;3'
[1267785912.275221] [2048.2] [pid=24083]   Just finished macro.  Running
output (55): 'SERVICE ALERT: testhost;Disk;CRITICAL;HARD;3'
[1267785912.275226] [2048.2] [pid=24083]   Processing part: ';(Return code
of 127 is out of bounds - plugin may be missing)
'
[1267785912.275230] [2048.2] [pid=24083]   Not currently in macro.  Running
output (118): 'SERVICE ALERT: testhost;Disk;CRITICAL;HARD;3;(Return code of
127 is out of bounds - plugin may be missing)
'
[1267785912.275234] [2048.1] [pid=24083]   Done.  Final output: 'SERVICE
ALERT: testhost;Disk;CRITICAL;HARD;3;(Return code of 127 is out of bounds -
plugin may be missing)
'
[1267785912.275237] [2048.1] [pid=24083] **** END MACRO PROCESSING
*************


We also ran an stace on the Nagios process while forcing a check:
[pid 16879] clone(Process 16880 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7fd49a633780) = 16880
[pid 16880] rt_sigaction(SIGPIPE, {SIG_DFL, [PIPE],
SA_RESTORER|SA_RESTART, 0x7fd499ef03c0},  <unfinished ...>
[pid 16880] <... rt_sigaction resumed> NULL, 8) = 0
[pid 16880] close(5 <unfinished ...>
[pid 16880] <... close resumed> )       = 0
[pid 16880] dup2(8, 1 <unfinished ...>
[pid 16880] <... dup2 resumed> )        = 1
[pid 16880] close(8)                    = 0
[pid 16880] execve("/bin/sh", ["sh", "-c",
"/usr/lib/nagios/plugins/check_nrpe -t 30 -H testhost.server -c
check_disk"], [/* 197 vars */]) = -1 E2BIG (Argument list
too long)
[pid 16880] exit_group(127)             = ?
Process 16880 detached
[pid 16879] wait4(16880, [{WIFEXITED(s) && WEXITSTATUS(s) == 127}], 0,
NULL) = 16880


The problem seems to be related/similar to this one:
https://sourceforge.net/mailarchive/message.php?msg_id=1234329173.3569.70.camel@localhost.localdomain

My guess is that the "Argument list too long" is
the $SERVICEGROUPMEMBERS$ macro...


Cheers,

Jeffrey Lensen
Hyves │ System Engineering
jeffrey at hyves.nl │ skyler.hyves.nl │ www.hyves.nl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20100305/6dea28df/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list