how is "Service check Latency" defined in nagios?

Max perldork at webwizarddesign.com
Tue Feb 10 00:51:26 CET 2009


Rahul,

On Mon, Feb 9, 2009 at 6:32 PM, Rahul Nabar <rpnabar at gmail.com> wrote:
> Thanks for the blog. Just found a very useful snippet there: "ps -e -a -x -f
> -o %u | sort | uniq -c | sort -rn" there. If I use this I find that the
> "nagios" owned processes seem to fluctuate a lot. Suddenly it goes as high
> as 54 and then for a while it owns only 3 processes. Then it shoots up
> again. Very interesting. Maybe that is the phenomenon you were referring to?
> I should probably wrap it in a bash wrapper and get it to graph the nagios
> processes in a 1 sec resolution to get a finer-time-grained idea of what is
> going on!

No, that is not the phenomena I was talking about :).

Every time nagios executes a check it forks to execute the check code
(will fork twice unless you configure it not too with large
installation tweaks or the double fork setting set to 0 explicitly).

The scheduling 'skew' I was referring to happens as Nagios sleeps
between executing checks or globally with the sleep_time setting or as
it schedules checks.  The various delay settings are all designed to
make Nagios not completely hammer client hosts or the network or
itself :p.  With NEB modules, when a module takes an action Nagios
pauses until the action is complete .. so if anything that takes more
than milliseconds happens, over time the scheduling of checks gets
slowly pushed into the future relative to the intervals they are
supposed to run at.

An example of skew would be a check that is initially scheduled at 5
10 15 20 eventually ending up being scheduled at 8 13 18 23 .. and as
that skew gets to pass 5 minutes, performance graphs end up with gaps.

Before tweaking our settings and doing the big no-no of adding a
fork() call to the PNP NEB module we were seeing checks that were
scheduled to run every 5 minutes have their scheduling times skew by
more than 5 minutes within a 12 hour period.  After the tweaks it
takes nearly 2 days for this to occur and we do a daily restart right
now so we don't ever have this skew push checks outside of their
original 5 minute window.

I have no complaints about how Nagios operates, it was designed first
and foremost to be a fault manager, not a time-series performance data
graphing engine and thankfully it is flexible and open enough that
would have been able to wrangle it to do what we want without having
to change any core Nagios code, which is very cool.

Graph nagios performance with PNP using the output of nagiostats from
scripts run from cron or by Nagios itself.

Regards,
Max

------------------------------------------------------------------------------
Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM)
software. With Adobe AIR, Ajax developers can use existing skills and code to
build responsive, highly engaging applications that combine the power of local
resources and data with the reach of the web. Download the Adobe AIR SDK and
Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list