Distributed Monitoring Central Server no status changes

Marc Powell marc at ena.com
Wed Feb 25 20:58:20 CET 2009


Hi Paul,

Please always respond on list so that others now, and in the future,  
can learn from your experience and so that you can benefit from the  
experience of others on the list. More below...

On Feb 25, 2009, at 12:54 PM, Paul Landauer wrote:

> On Wed, 2009-02-25 at 12:06 -0600, Marc Powell wrote:

> I'm using 2 servers following the documentation at
> http://nagios.sourceforge.net/docs/3_0/distributed.html

Thanks.

>> - example host and service definitions from both servers (complete
>> definitions please)
> Definitions are the same on both servers.
> Example host definition:
> define host{
> 	use	generic-host
> 	host_name	surf
> 	alias	Surf Control
> 	address	ip_address_of_surf_is_here
> 	max_check_attempts	5
> 	check_command	check-host-alive
> 	check_interval	5
> 	retry_interval	1
> 	check_period	24x7
> 	contact_groups	admins
> 	notification_interval	30
> 	notification_period	24x7
> 	notification_options	d,u,r
> 	}
>
> Example Service Definitions (surf is a member of  
> sunrise_windows_servers):
> define service{
> 	use			generic-service
> 	hostgroup_name		sunrise_windows_servers
> 	service_description	NSClient++ Version
> 	check_command		check_nt!CLIENTVERSION
> 	}

For future reference, these are not 'complete' since you use  
templates. There's lots of important information within those  
templates that's needed when troubleshooting as well. I expect that  
the definitions are indeed different between the servers when you take  
the templates into account otherwise your central server is doing  
active checks of the services in addition to receiving the passive  
checks, overwriting their results. (I don't think this is the problem).

>> - related nagios.log information from both servers
> I included excerpts that I thought applied.  If you'd like the whole
> log, let me know.
> Nagios.log for Distributed server:
> [1235575724] SERVICE ALERT: surf;Explorer;CRITICAL;HARD; 
> 3;Explorer.exe:
> not running
> [1235575724] SERVICE NOTIFICATION:
> nagiosadmin;surf;Explorer;CRITICAL;notify-service-by- 
> email;Explorer.exe:
> not running
>
> Nagios.log for Central Server:
> [1235575777] EXTERNAL COMMAND:
> PROCESS_SERVICE_CHECK_RESULT;surf;Explorer;0;Explorer.exe: not running
> [1235575778] PASSIVE SERVICE CHECK: surf;Explorer;0;Explorer.exe: not
> running

This is interesting and useful. As you can see, on your distributed  
server, the status is 3 (CRITICAL) but by the time NSCA dumps it into  
the command pipe on the central server, that has been translated to 0  
(OK) by something in the process. This could be because nagios isn't  
passing the correct status code to your submission script, your  
submission script is not interpreting or passing it to send_nsca  
correctly or nsca on the receiving side isn't reading it correctly.

>> - the contents of your check result submission script if it's not
>> exactly like the documented one.
> printfcmd="/usr/bin/printf"
>
> NscaBin="/usr/bin/send_nsca"
> NscaCfg="/etc/nagios/send_nsca.cfg"
> NagiosHost="I_have_the_ip_address_of_my_central_server_here"
>
> # Fire the data off to the NSCA daemon using the send_nsca script
> $printfcmd "%s\t%s\t%s\t%s\n" "$1" "$2" "$3" "$4" | $NscaBin -H
> $NagiosHost -p 5
> 721 -c $NscaCfg

To say whether this is correct or not I'd have to see your OCSP  
command definition. If you're using the $SERVICESTATE$ macro, then  
this is broken. send_nsca expects a numeric state code but  
$SERVICESTATE$ provides a grammatical code (OK, CRITICAL, etc).  
Normally that needs to be translated to the proper numeric by the  
submission script first but you can also use the $SERVICESTATEID$  
macro instead to get the numeric code. My bets are on this being the  
problem.

>> Running nagios and/or NSCA in debug mode on the central server might
>> provide additional information.
> Let me know if you still want this to be done.

Running NSCA in debug to see if it's receiving the 0 status code from  
the distributed machine would further narrow down the source of the  
problem.

--
Marc


------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list