Error 126 and 127 on multiple commands

Morris, Patrick patrick.morris at hp.com
Fri Oct 16 11:25:01 CEST 2009


I've been running Nagios for years, and today have run into an issue 
that's got me banging my head against a wall.

I've got a distributed setup, basically with two Nagios 3.1.0 machines 
on Red Hat EL4 running the same checks simultaneously. Today they both 
started reporting a return code of 126 or 127 for various commands that 
are not missing, and do not have permissions that would not allow Nagios 
to run them.

For example, this happens whenever a notification is attempted:

[1255684343] Warning: Attempting to execute the command "/usr/bin/printf 
"%b" "***** Nagios  *****\n\nNotification Type: PROBLEM\nNotification 
Number: 2\n\nService: MYSERVICE\nHost: myhost\nAddress: 
myhost.edited.com\nState: CRITICAL\n\nDate/Time: Fri Oct 16 02:12:22 PDT 
2009\n\nAdditional Info:\n\n(Return code of 127 is out of bounds - 
plugin may be missing)\n\nComment: : \n\nWiki: 
https://wiki.link\n\nNagios: 
https://nagios/nagios/cgi-bin/extinfo.cgi?type=2&host=myhost&service=MYSERVICE" 
| /bin/mail -s "PROBLEM: myhost/MYSERVICE CRITICAL **" noc at mydomain.com" 
resulted in a return code of 127.  Make sure the script or binary you 
are trying to execute actually exists...

If I use "su - nagios" and copy and paste the failed command at a 
command prompt, it works. The notification commands very consistently 
return a 127, while various checks (but not all of them) will return a 
126 or a 127.

Stranger, the same exact plugin (check_http, for example) may work fine 
for one service, but return an error code for another.

Now, my installation on this instance of Nagios is pretty large: 548 
hosts and about 8500 services. The same check configurations and 
plugins, however, are synched across 24 other Nagios boxes and assigned 
to different hosts, and those all work just fine. It's just this, my 
biggest installation, where they've started failing.

This feels to me like I've hit some sort of capacity limitation. I've 
pared down some things (like cutting a complicated escalation 
configuration from 24,000 escalations to 3,500), but that didn't help. 
I've offloaded half the checks to another system that submits passive 
results over nsca, but that didn't help either.

I've played with a lot of tuning settings like limiting concurrent 
checks, spacing out an aggressively tuned check schedule, and generally 
just screwing with stuff, but nothing's worked, and I'm wondering if 
someone's run into this sort of thing before, and might be able to point 
me at something I haven't tried yet.

For the record, there's no SELinux involved, and nothing unusual in the 
system logs.

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list