Latency ok but last check of service hours old?

Brad Johnson bjohnson at got.wedgie.org
Thu Sep 18 08:58:31 CEST 2003


Okay, I give up. I've tried to look everywhere save digging through the
source code.

Can anyone explain how Nagios calculates its latency? My box is
performing 301 service checks and they are all active checks. Earlier I
had max_concurrent_checks set to 0, service_reaper_frequency to 10, and
aggergate_status_updates to 1. My nagios box was launching a huge load of
processes, and some checks were just plain getting forgotten. If I looked
at the scheduling queue, I'd see services that had last been checked an
hour ago with the "next check" scheduled for 55 minutes in the past. Yet
my latency numbers according to Performance Info were only at a few
seconds. I don't see how that could be. I double-checked to make sure no
old nagios "boss" processes were running (I know this can confuse it
sometimes). No luck. I tried turning on the "check for orphans" thing, and
that just seemed to make it worse because it kept re-scheduling old
"forgotten" checks before they got run.

It makes me think that maybe the checks are actually executing on time
(so that Nagios's concept of latency is still okay), but the service
reaper isn't updating the status properly. Would that explain what I'm
seeing? On the mailing list archives I saw a couple of people who were
asking similar questions and experiencing similar problems to mine, but no
one seemed to have a good answer.

I tried a number of different things, and I seem to have finally got
something that is working for me. I set up all my checks so they would
time out in no more than 15 seconds. My normal check intervals are 5
minutes. I turned off aggregate_status_updates and changed my
max_concurrent_checks to something reasonable like 30 (whatever nagios -s
suggested). I also disabled some checks that I knew would fail cause the
hosts are down right now. At the moment, my "oldest" service check is only
five minutes old. That's good. It makes me happy.

Can anyone shed some light on what might be going on? I'm gonna go delve
into the source code hoping to find some clues. If I find anything, I'll
post back. But I'm kinda hoping someone has already done the hard work.
:-)

Thanks,
Brad Johnson

P.S. I tinkered with almost every setting at one point or another today
with no real change. Only after I disabled aggregate_status_updates did my
"oldest" service checks start looking more reasonable. I noticed that in
my old netsaint config, I had aggregate_status_updates disabled as well.
Could this be the culprit?





-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list