Transient errors

Frank Bulk frnkblk at iname.com
Sun Mar 11 22:13:35 CET 2012
Previous message: Transient errors
Next message: Nagios 3.2.3 -> 3.3.1 upgrade path
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Yes, we have to make the same kinds of tweaks in our environment.  Sometimes
I've had to develop a new plugin or monitor different elements that will
alert me to the situation more quickly.

Frank

-----Original Message-----
From: Andreas Ericsson [mailto:ae at op5.se] 
Sent: Friday, March 02, 2012 4:15 AM
To: Nagios Users List
Subject: Re: [Nagios-users] Transient errors

On 03/01/2012 10:38 PM, David Dyer-Bennet wrote:
> 
> I see a lot of transient errors on services and hosts I'm monitoring.
> Hence finding ways to keep notifications from going out on situations that
> will resolve themselves are kind of an issue.
> 
> I've played with how many failures in a row are needed to cause a
> notification, and have that set differently for things I'm monitoring
> across long links (Beijing, say) compared to things I'm monitoring locally
> or in New York.  Of course, one problem with that is that it makes it take
> longer before a real problem causes a notification.  Right now it takes
> over 15 minutes for the total failure of our link to Beijing to cause a
> notification.
> 
> For things that are numeric values, I can play with the critical and
> warning ranges to potentially reduce false positives.  That, at least,
> doesn't slow down recognition of total failures.   Some things just don't
> seem to fit the Nagios model -- for example it's quite normal for the SQL
> server to pull 100% of the cpu for periods now and then, but if it goes on
> too long, *that's* unusual.  Hmm; I suppose I could override the number of
> failures needed to cause a notification in the service definition for
> htose, couldn't I? There may be some things I should just stop monitoring
> (there aren't clear-cut "okay" and "bad" behaviors that I can quantify).
> 
> I guess I'm wondering if there are useful basic approaches to handling
> this problem that I'm missing, or if I just need to work through the
> details more carefully.   I'm startled at how often I get isolated
> failures for no apparent reason.  Is that normal for most people
> monitoring services?  I think I'm finding my connections time out now and
> then due simply to load, without the load actually being at all high.

Apart from the great writeup Mark wrote, I'd like to add that you can also
set "first_notification_delay" for both hosts and services. That will make
the services and hosts appear red and critical in the ui, but it will delay
notifications for AT LEAST the specified amount of time (multiplied with
interval_length, so usually it means minutes).

I've stressed AT LEAST, since first_notification_delay requires that a
check is run in order to trigger the notification, so the delay could
sometimes be greater than what you specify. Some people are a bit freaked
out by that, so you'd best know it before you start using it.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

----------------------------------------------------------------------------
--
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6517 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/users/attachments/20120311/48b6de22/attachment.bin>
-------------- next part --------------
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: Transient errors
Next message: Nagios 3.2.3 -> 3.3.1 upgrade path
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list