Distributed monitoring quirks

mark mark at woodstream.net
Thu Mar 6 19:15:19 CET 2003


Hi all,

I'm setting up distributed monitoring and can't quite get the behavior I'm 
after. I'm guessing I just haven't hit on the right configuration but 
after a day of working on it I thought I'd ask the list. My environment 
uses a central server the recieves passive checks from a distributed 
server. The two servers are connected by a site-to-site VPN. This is 
important because it means the central monitor can not see the remote 
hosts being monitored. So host checks from the central monitor won't work 
if they are a ping. Now on with the problem description...

I have the distributed monitoring server up and running fine. It is 
working as expected and sending updates to the central server fine.

The problem I'm having is on the central server. If a remote service goes 
to hard critical, the distributed monitor picks it up fine. The central 
monitor recieves the event but the critical services shows up as 
"disabled". Further, if the remote host goes to hard critical (i.e. down), 
the distributed monitor also see's that fine. On the central server, I 
never see anything about the host being down .. not even a hint.

Now, as for configurations. On the central monitor, a host looks like this

# Generic host definition template
define host{
        name    	generic-host    ; The name of this host template
        notifications_enabled           1  ;Host notifications are enabled
        event_handler_enabled           1  ;Host event handler is enabled
        flap_detection_enabled          1  ;Flap detection is enabled
        process_perf_data               1  ; Process performance data
        retain_status_information       1  ; Retain status information
        retain_nonstatus_information    1  ; Retain non-status info
        register                        0  ; DONT REGISTER 
        }

# 'deathstar' host definition
define host{
        use                     generic-host    ; Name of host template
        host_name               deathstar
        alias                   deathstar.company.com
        address                 10.1.1.1
        max_check_attempts      10
        notification_interval   120
        notification_period     24x7
        notification_options    d,u,r
        }

Note there is NO check_command definition. Because the central monitor can 
not see the remote hosts, I had to remove the check_command entry. 
Otherwise, each time a service had a problem, the central monitor would 
try to ping the remote host, fail, and mark the host as being down 
incorrectly. I have a feeling the lack of a check_command is why I never 
see remote hosts go down... even when the distributed monitor sees them 
go down.

A service entry on the central  monitor looks like this:

define service{
        name                            passive-service  ; Template name
        active_checks_enabled           0       ; Disable Active checks
        passive_checks_enabled          1       ; Enable Passive checks
        parallelize_check               1       ; parallelize checks
        obsess_over_service             1       ; obsess over this svc
        check_freshness                 1       ; check service fresh
        freshness_threshold             900     ; Stale if over 15 min.
        notifications_enabled           1       ; enable notification
        event_handler_enabled           1       ; enable event handler
        flap_detection_enabled          1       ; enable flap detection
        process_perf_data               1       ; Process performance data
        retain_status_information       1       ; Retain status info
        retain_nonstatus_information    1       ; Retain non-status info
        check_command   no-passive-update       ; if stale run this cmd
        register                        0       ; DONT REGISTER THIS 
        }

# Service definition
define service{
        use                             passive-service         ;template
        host_name                       mx1,mx2,mximc1
        service_description             SMTP
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  unix-admins
        notification_interval           120
        notification_period             24x7
        notification_options            w,u,c,r
        }

Note passive checks are enabled and active checks are disabled. I'm 
guessing a hard critical service shows up as "disabled" on the central 
server because the service definition has active checks disabled. The 
no-passive-update command simply echos a CRITICAL warning that the passive 
check is stale (as defined by freshness_threshold).


So...does anyone have some ideas on how I can do distributed monitoring, 
in my situation where the central monitor can not see the remote hosts, 
have hard critical service events not show up as "disabled" and get 
critical host events to show up at all on the central monitor?

Any and all input is greatly appreciated!
Thanks,
Mark



-------------------------------------------------------
This SF.net email is sponsored by: Etnus, makers of TotalView, The debugger 
for complex code. Debugging C/C++ programs can leave you feeling lost and 
disoriented. TotalView can help you find your way. Available on major UNIX 
and Linux platforms. Try it free. www.etnus.com
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list