Service check goes HARD too quick if multiple service are in problem state

Andrew Thompson andrew at fulgent.co.uk
Tue Jan 15 18:51:35 CET 2013


Hi,

I have had this problem previously and posted here but not go nowhere with it.

Ill have another bash.....

Basically my nagios machine is checking too frequently and firing out alerts too quickly

Its ignoring the retry_interval value, the max_check_attempts value and ingoring the notification_interval  value in the escalations.

I have check interval of 5 minutes in OK state
Retry interval of 3 minutes when in problem state
Notification interval of 3 minutes

I believe that below is the problem and multiple service checks in problem state at the same time is casuing this.


Ive just seen this on 1 of my hosts:

It appears its accumulating the service checks (even though they are different checks) into a final HARD state.

Prior to 17:18 all was fine on this host!!!


Then at 17:18 a SQL check went to warning state and to SOFT 1

Checked again at 17:21 which is the 3 minute interval I have told it too when in problem and its still warning so onto SOFT2

Then a different service check on that host goes critical - but for the first time

17:22 memory usage and it puts this to HARD 3 - even though this actual check for memory should be SOFT1

An alert then got sent straight out for the Memory check even though it was actually only check 1/3 on that particular service

Here is the copy and past from the History of the host

[01-15-2013 17:18:24]
SERVICE ALERT: SERVER;SQL LOCK TIMEOUTS;WARNING;SOFT;1;WARNING - 2.3067 lock timeouts / sec for _Total, 2.0667 lock timeouts / sec for Key, 0.0000 lock timeouts / sec for RID, 0.2400 lock timeouts / sec for Page, 0.0000 lock timeouts / sec for Object, 0.0000 lock timeouts / sec for Metadata, 0.0000 lock timeouts / sec for HoBT, 0.0000 lock timeouts / sec for File, 0.0000 lock timeouts / sec for Extent, 0.0000 lock timeouts / sec for Database, 0.0000 lock timeouts / sec for Application, 0.0000 lock timeouts / sec for AllocUnit
[01-15-2013 17:21:24]
SERVICE ALERT: SERVER;SQL LOCK TIMEOUTS;WARNING;SOFT;2;WARNING - 1.3056 lock timeouts / sec for _Total, 1.1833 lock timeouts / sec for Key, 0.0000 lock timeouts / sec for RID, 0.1222 lock timeouts / sec for Page, 0.0000 lock timeouts / sec for Object, 0.0000 lock timeouts / sec for Metadata, 0.0000 lock timeouts / sec for HoBT, 0.0000 lock timeouts / sec for File, 0.0000 lock timeouts / sec for Extent, 0.0000 lock timeouts / sec for Database, 0.0000 lock timeouts / sec for Application, 0.0000 lock timeouts / sec for AllocUnit

[01-15-2013 17:22:04]
SERVICE ALERT: SERVER;MEMORY USAGE;CRITICAL;HARD;3;CRITICAL: physical memory: Total: 10G - Used: 9.81G (98%) - Free: 192M (2%) > critical



Does anybody please have any idea why my server is checking too frequently and alerting too frequently and why its totting up different service checks?

This machine has done nothing but not work right since it was loaded a couple months ago.
Im using the come config files on it as I did on the previous box I had - only difference was that was running 3.3.1 - I had none of these problems on that install.


This is a Nagios 3.4.1 install on a Ubuntu 12.04 desktop 32 bit OS



Thanks in advance

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20130115/0661ba07/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Master SQL Server Development, Administration, T-SQL, SSAS, SSIS, SSRS
and more. Get SQL Server skills now (including 2012) with LearnDevNow -
200+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only - learn more at:
http://p.sf.net/sfu/learnmore_122512
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list