Service check delays in distributed monitor setup

Fred f1216 at yahoo.com
Mon Sep 12 15:54:46 CEST 2005
Previous message: Service check delays in distributed monitor setup
Next message: roomity.com spam
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I believe I have found the source of my issue around service check delays
in the distributed monitoring setup.

Many thanks to Andreas Ericsson for reminding me about socket and child
resource requirements ...

    If you read the code you'll notice that the active checks also write 
    their service results to the FIFO. This is a showstopper on the road to 
    "scale like hell", so a few various other methods are being tested. 
    Multiplexing several children from a single parent seems the way to go. 
    509 checks can run smoothly at once on a modern system (round about 
    1017 
    if you don't let the child have an stderr). The limit is set by 
    sysconf(_SC_OPEN_MAX) / 2, or sysconf(_SC_CHILD_MAX), whichever is 
    lowest.

My file open ulimit for nagios was at 1024, the default.  By doing a 
ulimit -n 8192 in my nagios service startup script, all things came back
to normal ... services started scheduling, processes stopped hanging, it
was a beautiful thing ;-) 

The interesting observation from this is that there seemed to be no failure
messages from nagios around not being able to fork child processes or resource
type failure messages in any logs.  I suspect some limits where being crossed
but nothing was reporting it.

Thanks to all who responded.
-FredC

--- Fred <f1216 at yahoo.com> wrote:

> Unfortunately, setting the increment to a small number only worked to
> set the pending state to something that looked reasonable, however, the
> services still never get scheduled.
> 
> My configuration *was* working at one point, I tweaked something and
> now no matter what I do, I can't get it to start monitoring again.  My
> passive checks recieved from other monitor nodes all seem to get registered,
> its just the active checks that run on the master (head) node never see
> the light of day any more.  If I regenerate the configuration to not use
> distributed monitoring, it works just fine, however, that puts way too much
> pressure on a single node.  I removed the status.sav, but as I type
> this I'm thinking I should nuke all the cache files that nagios builds, maybe
> there is something that got munged in there ...
> 
> We've used both Nagios 1.2 and now 2.0b3 (testing 2.0b4) and I have yet
> to need to crack open the source and make any mods ... looks like that time
> is coming ;-) 
> 
> -FredC
> 
> --- misc at viceconsulting.co.nz wrote:
> 
> > Hi Fred,
> > 
> > I have encountered the exact same problem with my central Nagios server.
> > It has about 1000 passive services, but only about 10 active services (the
> > active services being used for the central Nagios server to self-monitor
> > itself).  The 1000 passive services receiving their results from the 5
> > distributed servers.
> > 
> > When I restart the Central Nagios server, the active checks get scheduled
> > for 3 hours+ into the future, but they never actually seem to run.  For
> > days the active checks have not actually been checking themselves.
> > 
> > I tried changing the service_inter_check_delay_method to d for dumb, which
> > appeared to schedule it when I expected (ie within about 5 mins after the
> > restart) but it still didn't run them.
> > 
> > Your idea of setting service_inter_check_delay_method=0.05 sounds good.  I
> > haven't had any luck getting the 10 or so active services checking on my
> > central Nagios server.
> > 
> > Is anyone able to confirm that this is a known problem in Nagios, is there
> > a better workaround, is this to be fixed in 2.0 final?
> > 
> > Fred, keep the list posted if you make further breakthroughs.
> > 
> > Cheers
> > Alex
> > 
> > On 7 Sep 2005 at 11:03, Fred wrote:
> > 
> > > I think I have found the source of my issue with distributed monitoring
> and
> > > service checks.
> > >
> > > It turns out that if you enable distributed monitoring, even passive
> > service
> > > check definitions seem to get scheduled to run when nagios starts up.  If
> > > you have say 10350 services (give or take one) and use smart scheduling
> of
> > > services, you could easily see 3+ hours between the time that the first
> > service
> > > is scheduled and the last one.   Changing the smart schduling to "n" for
> > > no delay causes the services to not be scheduled in the future, but by
> the
> > > time nagios processes the entire configuration file, the start time is in
> > > the past and I think nagios forgets about the service so it is never
> > scheduled
> > > again.
> > >
> > > I'm currently trying a service_inter_check_delay_method=0.05 which puts
> me
> > > at about 3 minutes for 10,000+ services, which seems to be enough time
> for
> > > nagios to startup and still have its first pending service scheduled in
> the
> > > near future rather then the near past ...
> > >
> > > Does this make sense to anyone who has been messing with these
> > configuration
> > > settings?
> > >
> > > Is there a better way to do this?  I.e., I would like for nagios to *not*
> > > consider the passive checks in any scheduling.  I actually only have a
> > small
> > > number of active checks which when run will populate the rest of the
> > passive
> > > checks for the entire cluster, the problem is that it seems the node that
> I
> > > run these checks on is alphabetically *after* all of the other nodes so
> it
> > > seems to be scheduled last and has services starting the furthest out.
> > >
> > > Thanks.
> > > -FredC
> > >
> > >
> > >
> > >
> > >
> > >
> > > -------------------------------------------------------
> > > SF.Net email is Sponsored by the Better Software Conference & EXPO
> > > September 19-22, 2005 * San Francisco, CA * Development Lifecycle
> Practices
> > > Agile & Plan-Driven Development * Managing Projects & Teams * Testing &
> QA
> > > Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> > > _______________________________________________
> > > Nagios-users mailing list
> > > Nagios-users at lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/nagios-users
> > > ::: Please include Nagios version, plugin version (-v) and OS when
> > reporting any issue.
> > > ::: Messages without supporting info will risk being sent to /dev/null
> > >
> > >
> > 
> > 
> > 
> > 
> > 
> > -------------------------------------------------------
> > SF.Net email is Sponsored by the Better Software Conference & EXPO
> > September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> > Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> > Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> > _______________________________________________
> > Nagios-users mailing list
> > Nagios-users at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nagios-users
> > ::: Please include Nagios version, plugin version (-v) and OS when
> reporting
> > any issue. 
> > ::: Messages without supporting info will risk being sent to /dev/null
> > 
> 
> 
> 
> 
> 
> 
> 
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting
> any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
> 







-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: Service check delays in distributed monitor setup
Next message: roomity.com spam
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list