Problem with some NSCA packets getting corrupted on 64-bit SLES 10

Frost, Mark {PBG} mark.frost1 at pepsi.com
Tue Jan 22 17:05:55 CET 2008


Brian,

You beat me to the punch.  After a few days of trying to figure out the
pattern, I found this was only happening when the distributed nodes were
trying to do host checks.  Further discover revealed that we were using
'fping' to check host reachability which did include a ',' in the
output.  The "send" shell script I was using at the time passed -d ','
to send_nsca to use as a delimiter.

So while the actual host check was sending only 3 fields to the
send_service_check script, the arguments to send_nsca were causing it to
be broken into 4 fields so the NSCA daemon assumed it was a service
check.

Not that it matter which I use, but I switched over to use Ethan's
script.  I guess when no arguments are passed to send_nsca, it breaks on
a tab as a delimiter.  Anyway, that part of my migration has been
working fine.

Glad to see the whole 64-bit business was a red herring in my setup
(whew!).

Thanks for your help.

Mark

-----Original Message-----
From: nagios-users-bounces at lists.sourceforge.net
[mailto:nagios-users-bounces at lists.sourceforge.net] On Behalf Of Brian
A. Seklecki
Sent: Saturday, January 19, 2008 12:32 PM
To: Frost, Mark {PBG}
Cc: nagios-users at lists.sourceforge.net
Subject: Re: [Nagios-users] Problem with some NSCA packets getting
corruptedon 64-bit SLES 10

MF:

Show us your ocsp_command and ochp_command mappings.  Are you calling a
piped command from checkcommands.cfg or calling an external shell
script?

I guarantee you the comma (",") in results is being mapped into a field
delimiter, which confuses nscad(8).

~~BAS 

On Thu, 2008-01-17 at 10:37 -0500, Frost, Mark {PBG} wrote:
> I've recently begun an effort to move our Nagios installation to a
> distributed architecture from a centralized one.  I had previous used
> NSCA only for a very few passive checks and it works fine on a 32-bit
> Red Hat AS 3 platform (the centralized server).
> 
> In testing on a distributed architecture (which is 64-bit Suse Linux
> Enterprise Server (SLES) 10), I seem to have a problem with NSCA.
(Note
> that all Nagios and NSCA binaries and libraries were recompiled on the
> 64-bit platform).
> 
> After I broke out all the checks to have 2 separate distributed nodes
> send to a central server, I saw a few messages like this one in the
> nagios.log file:
> 
> [1200583727] Warning:  Passive check result was received for service
'0'
> on host 'HOSTXXX', but the service could not be found!
> 
> but only about every 1 out of 10 or maybe 20 results was doing this.
> That is, the rest of the results were being correctly shown as
"EXTERNAL
> COMMAND" and all expected NSCA fields came up correctly (hostname,
> service desc, check result, text output).
> 
> I started having the "send_nsca" script from the distbributed nodes
log
> what they were sending to a file.  When I correlate what they're
sending
> with what the NSCA daemon thinks it's receiving, the client is still
> sending the correct 4 fields, but it's as if the NSCA daemon is
dropping
> the 2nd field (service desc) and replacing it with the check result
> field.  So ultimately, it thinks the service name is '0'.
> 
> I can't see that this matches a pattern (i.e. always on the same hosts
> or same service checks).  All I've seen so far is that it happens
> whether I run NSCA as --single or --daemon.  It also happens even if I
> turn off one of the distributed nodes (that is, I can't see it being
> volume related).
> 
> I have turned on debugging in the NSCA daemon to see what it thinks
it's
> getting and it echoes what the nagios.log shows:
> 
> SERVICE CHECK -> Host Name: 'HOSTXXX', Service Description: '0',
Return
> Code: '0', Output: ' rta=0.140000 ms)'
> 
> Again, maybe only 1 out of 10.  Ultimately, this causes the server to
> run an active check as it thinks it never got a result from the
> distbributed node.
> 
> I'm still trying to dig deeper, but it seems to me that this is
> increasingly pointing to some issue with 64-bit SLES.  Or perhaps some
> variable type in NSCA daemon that's not quite right for 64-bit.  It's
> hard to tell with its intermittent nature and the fact that I have yet
> to discover a pattern.
> 
> Has anyone seen anything like this before?
> 
> Thanks
> 
> Mark
> 
>
------------------------------------------------------------------------
-
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
> 
> 
> 
> 
> 
> 


------------------------------------------------------------------------
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when
reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list