More info/leads: Huge delay in scheduling freshness service chec k after 3rd try

Ethan Galstad nagios at nagios.org
Thu Mar 27 03:50:57 CET 2003


If you haven't change the interval_length directive from its default 
value of 60, the problem lies in your config.  In your service 
definition you are telling Nagios to actively check the service every 
15 minutes.  However, the freshness threshold for that service is set 
at 3 minutes (180 seconds).  About once every minute the freshness 
checks get performed, so that's why you're seeing staleness warning 
approximately every 4 minutes. 


On 26 Mar 2003 at 10:18, Erik Larkin wrote:

> 
> FYI, after a little more research, I think I've narrowed it down to an issue
> with the scheduling queue.  I tossed a debug option in my stale-service
> script that logs the time it's called.  Then I cross-referenced those times
> with the times that nagios logged a failed freshness check, and the times
> that nagios received the response from the stale-service script.  The time
> difference between when the script is actually called and when nagios logs
> the script response is maybe a few seconds, leading me to believe that the
> service reaper is ok.  However, the delay between when nagios says it failed
> a freshness check and is forcing a service check, and when the stale-service
> script is actually called, was over 32000 seconds at last failure.  So, I'm
> now focusing on problems with the scheduling queue.  Any ideas, anyone?
> 
> -----Original Message-----
> From: Erik Larkin 
> Sent: Tuesday, March 25, 2003 2:44 PM
> To: 'nagios-users at lists.sourceforge.net'
> Subject: Huge delay in scheduling freshness service check after 3rd try
> 
> 
> 
> Allright, I'm finally admitting that I can't figure this one out myself.
> Trust me, it's a difficult admission, and has involved much tinkering,
> hair-pulling, and searching of mailing lists (although searching doesn't
> seem to be working right now for the sourceforge lists?).
> 
> Anyways, I've got a nagios architecture with multiple distributed servers
> sending check results to a central nagios server via ncsa.  The central
> server doesn't perform any active checks (no network access to the
> distributed network), but is configured to perform a freshness check for a
> service called 'Heartbeat' for each distributed instance.  The heartbeat is
> just a ping of the loopback performed every minute, although I've since
> discovered I could have used check_dummy.  Seems to be a pretty common
> setup, and for the most part it works very well.
> 
> Except for the freshness checks.  They work fine up until the 3rd failed
> freshness check or so,  at which point latency skyrockets.  From 99 to 280
> to 749, on up to thousands and thousands of seconds of latency.  The log
> reflects a failed freshness check, and a message about forcing the service
> (which is the typical echo and exit 2).  But the service alert response is
> delayed more and more.  I've tried everything I can think of, and learned a
> great deal in my searching and tweaking, but I can't change this behavior.  
> 
> Here's what I've tried:
> 
> - change the service_reaper_frequency to 3.  saw a reference for this in the
> list for something else, thought it might help.  I still suspect some
> problem with the service reaper.
> - added a 1 second sleep to the script (thought maybe it was returning its
> status too quickly)
> - futzed with the normal_check_interval for the heartbeat service on the
> central server.  gave it ranges between 1 minute and 15 minutes.
> - enabled check_for_orphaned_services
> - tossed a debug option in my stale_service script that sent a line of
> output to a log, to make sure that the script itself was being run (it was)
> - setting is_volatile  (just to check)
> - other things I can't think of right now.
> 
> And here's the service entry:
> 
> define service{
>         use                             qab24x7-service
>         service_description             Heartbeat
>         hostgroup_name                  qabdbfohub
>         normal_check_interval           15
>         is_volatile                     1
>         max_check_attempts              1
>         check_freshness                 1
>         notification_interval           15
>         freshness_threshold             180
>         check_command                   stale-service
>         }
> 
> 
> And here's some relevant log snippets:
> 
> 
> [1048628465] Warning: The results of service 'Heartbeat' on host
> 'sj-qab-db01' are stale by 57 seconds (threshold=180 seconds).  I'm forcing
> an immediate check of the service.
> [1048628471] SERVICE ALERT: sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
> Heartbeat check is stale!
> [1048628705] Warning: The results of service 'Heartbeat' on host
> 'sj-qab-db01' are stale by 58 seconds (threshold=180 seconds).  I'm forcing
> an immediate check of the service.
> [1048628711] SERVICE ALERT: sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
> Heartbeat check is stale!
> [1048628945] Warning: The results of service 'Heartbeat' on host
> 'sj-qab-db01' are stale by 57 seconds (threshold=180 seconds).  I'm forcing
> an immediate check of the service.
> [1048628966] SERVICE ALERT: sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
> Heartbeat check is stale!
> [1048629185] Warning: The results of service 'Heartbeat' on host
> 'sj-qab-db01' are stale by 42 seconds (threshold=180 seconds).  I'm forcing
> an immediate check of the service.
> [1048629287] SERVICE ALERT: sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
> Heartbeat check is stale!
> [1048629485] Warning: The results of service 'Heartbeat' on host
> 'sj-qab-db01' are stale by 21 seconds (threshold=180 seconds).  I'm forcing
> an immediate check of the service.
> [1048629770] SERVICE ALERT: sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
> Heartbeat check is stale!
> [1048629965] Warning: The results of service 'Heartbeat' on host
> 'sj-qab-db01' are stale by 20 seconds (threshold=180 seconds).  I'm forcing
> an immediate check of the service.
> [1048630715] SERVICE ALERT: sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
> Heartbeat check is stale!
> [1048630925] Warning: The results of service 'Heartbeat' on host
> 'sj-qab-db01' are stale by 31 seconds (threshold=180 seconds).  I'm forcing
> an immediate check of the service.
> 
> 
> 
> Sorry for the the long email/spam, but please oh please:  does anyone have
> any info regarding this problem?
> 
> 
> Many thanks,
> 
> Erik Larkin
> elarkin at nuasis.com
> 
> p.s.  Just to go on record, I do think Nagios rocks.  Hard.  But this itty
> bitty problem is driving me nuts!  ;)
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by:
> The Definitive IT and Networking Event. Be There!
> NetWorld+Interop Las Vegas 2003 -- Register today!
> http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
> 



Ethan Galstad,
Nagios Developer
---
Email: nagios at nagios.org
Website: http://www.nagios.org



-------------------------------------------------------
This SF.net email is sponsored by:
The Definitive IT and Networking Event. Be There!
NetWorld+Interop Las Vegas 2003 -- Register today!
http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list