passive service checks with 1 second interval

Risto Vaarandi risto.vaarandi at seb.ee
Fri Aug 10 13:43:25 CEST 2007


hi all,
yesterday I attempted to implement passive checks for a volatile service 
with 1 second interval (i.e., once a second, the status of a service is 
written to Nagios command file), but I am experiencing some problems 
with how the service status is displayed (and notifications). Since I 
haven't implemented such checks before, I'd like to consult with more 
experienced users if Nagios alone is suitable for monitoring externally 
submitted checks with such a short interval.

If the service is up, the Nagios log shows that it reads the status 
without any delay from its command file:

[1186719368] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;0;node03 up at 1186719368
[1186719369] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;0;node03 up at 1186719369
[1186719370] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;0;node03 up at 1186719370
[1186719371] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;0;node03 up at 1186719371
[1186719372] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;0;node03 up at 1186719372

However, then the service goes to a critical state:

[1186719373] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at 1186719373

and starting from this moment, external checks are read from command 
file with 9-10 second intervals, with a "service alert" and notification 
at the end of each activity burst:

[1186719384] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at 1186719374
[1186719384] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at 1186719375
[1186719384] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at 1186719376
[1186719384] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at 1186719377
[1186719384] SERVICE ALERT:
node03;NodeState;CRITICAL;HARD;1;node03 DOWN at 1186719373

Then the service goes up, and the after a while I am seeing the 
following log entries:

[1186719447] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;node03 up at 1186719447
[1186719447] Warning: The results of service 'NodeState' on host 
'node03' are stale by 11 seconds (threshold=60 seconds).  I'm forcing an 
immediate check of the service.

I am the freshness checks enabled, and the the service status is 
reported as stale, although it was reported as normal shortly before.

As a result, I am seeing service notifications with wrong timestamps - 
the notifications appear after 18 second intervals, although the DOWN 
service checks are submitted after 1 second intervals. In addition, the 
service status is reported as stale after it has gone up.

Is there a way to speed up the processing of CRITICAL service checks? 
I'd like to get a notification within the same second.

br,
risto

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list