Active host check scheduling in a distributed environment

Paul Corcoran paul.corcoran.mlist at gmail.com
Wed Jul 15 11:19:36 CEST 2009
Previous message: Active host check scheduling in a distributed environment
Next message: service definitions & use
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
2009/7/14 Marc Powell <marc at ena.com>

>
> On Jul 14, 2009, at 9:46 AM, Paul Corcoran wrote:
>
> > HI,
> >
> > I run a distributed Nagios environment consisting of 1 parent server
> > and 2 child servers.
> >
> > The child servers perform all the service checking while the parent
> > server should be performing active service checks.
>
> Both the child server and the central server are performing active
> service checks?
>

Only the child servers are performing active service checks. The parent
server will check services only after the freshness threshold of 15 minutes
has passed


>
>
> > The host definitions are configured to perform host checks every 5
> > minutes. The retry interval is 1 minute and the max attempts is set
> > to 5.
>
> On both or are you submitting passive host checks or are you expecting
> the central machine to initiate it's own active checks of hosts?


At the moment I'm expecting the parent server to perform it's own active
host checks


>
>
> > We are monitoring 580 hosts and approx 4000 services.
> >
> > I noticed when a host down was detected the parent server did not
> > perform any retries of the host. This led to the status of the host
> > being stuck in a SOFT state and therefore no alerts were sent out as
> > required. I noticed that the child server performed the host checks
> > without any problem and the host was logged as being in a HARD down
> > state after 5 failed attempts.
>
> I'm not sure what configuration you could have that would lead to
> this. Can you post the host{} definition and any relevant log entries?
> Are you only sending a single passive host result and have
> 'passive_host_checks_are_soft' set in nagios.cfg?


define host{
        host_name               test_www01
        alias                   test www01 Server
        address                 x.x.x.x
        check_command           check-host-alive
        check_interval          5
        retry_interval          1
        max_check_attempts      5
        check_period            24x7
        notification_interval   60
        notification_period     24x7
        notification_options    d,u,r
        contact_groups          ops
}

If this host goes down the parent server notices this and records a soft
state. There was nothing in the logs indicating any retries. The child
server did the requisite recheck at the appropriate intervals and flagged
the state as HARD after the 5th failure.


>
>
> > Is there a specific variable in nagios.cfg that explicitly tells the
> > server to perform active checks?
>
> There are a few --
>        - in nagios.cfg - execute_host_checks=<0/1>
>        - in your host definition - active_checks_enabled [0/1], an
> appropriate check_period, check_interval and retry_interval set and an
> appropriate check_command set.


execute_host_checks=1 is in the nagios.cfg file.

>
>
> > Is it best practice to have the 2 child servers perform passive host
> > checks?
>
> I have no opinion on this other that to say that if you trust the
> remote nagios' to correctly report on services, they can usually be
> trusted to correctly report on hosts.
>
> > Is it possible that processing all the passive service check info is
> > causing the parent server to lag behind in it's own process queue?
>
> Not likely, IMHO, assuming you're using somewhat modern hardware. You
> can see for sure under Performance Info though. Look for high
> latencies (minutes)... This is a measure of how long after a check was
> scheduled to run that it actually it ran.


The average latency at the moment for active hosts checks is 145645 seconds.
This seems very excessive and there would appear to be a bottleneck
somewhere that's causing this.

I think I'll probably have to go with passive hosts checks at this stage but
it would be nice to know why this situation is occurring.

Thanks,

Paul


>
>
> --
> Marc
>
>
>
> ------------------------------------------------------------------------------
> Enter the BlackBerry Developer Challenge
> This is your chance to win up to $100,000 in prizes! For a limited time,
> vendors submitting new applications to BlackBerry App World(TM) will have
> the opportunity to enter the BlackBerry Developer Challenge. See full prize
> details at: http://p.sf.net/sfu/Challenge
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20090715/8431e038/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time, 
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize  
details at: http://p.sf.net/sfu/Challenge
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: Active host check scheduling in a distributed environment
Next message: service definitions & use
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list