How to troubleshoot when not receiving alerts?]

Marc Powell marc at ena.com
Fri Jul 25 19:00:15 CEST 2008


On Jul 25, 2008, at 10:12 AM, John Oliver wrote:

> On Thu, Jul 24, 2008 at 11:12:55PM -0500, Marc Powell wrote:
>>

> I just checked nagios.cfg and:
>
> interval_length=1

All your intervals are in seconds then. The default is 60.

>>> thought I had the errors fixed... the last email I got said  
>>> RECOVERED
>>> (even though I should be getting CRITICAL alerts, as there is 1%  
>>> disk
>>> space left).  I changed the notification_interval, and never saw
>>> another
>>> email.
>>
>> Does the web interface show the status as CRITICAL? If you received a
>> recovery notification the service was considered to be OK. What did
>> you fix?
>
> No.  The web interface is really confusing for this server:
>
> ftp	UP      N/A     486d 17h 50m 1s
>
> It has not been up for 486 days.  And this is the one device that has

You should verify your command{} definition for whatever the UP check  
is. That's a check that you or your predecessor created and not a  
'standard' check. If it's not been UP for 486 days then it seems  
you're not checking what you think you're checking.

> N/A for last check.  It's green and "UP".  But that doesn't change the
> fact that nrpe reports 1% of disk space left, and that the nagios  
> server
> can see that at least when I manually run the command.

Correct, they'd be completely unrelated.

> I'm starting to read about is_volatile, but I'm not really  
> understanding
> it.  One example is "things that automatically reset themselves to an
> "OK" state each time they are checked"  That certainly isn't the case
> with a disk space check.

Correct. Most services are not volatile. An example would be an SNMP  
trap. For every trap you receive, you want to send a notification  
regardless of the status of the previous trap. A volatile service  
sends out a notification for *every* non-OK check result for that  
service.

> command[check_disk]=/usr/lib/nagios/plugins/check_disk -w 20 -c 10 -p
> /dev/mapper/VolGroup00-LogVol00

Warn if less than 20MB are free, Critical if less than 10MB are free  
-- that common mistake I referenced.

> [root at cerberus ~]# su nagios -c "/usr/lib/nagios/plugins/check_nrpe -H
> ftp -c check_disk"
> DISK OK - free space: / 782 MB (0% inode=99%);|
> /=134653MB;142786;142796;0;142806

It's OK according to the criteria you've defined; you've got another  
762M to go before warning ;-). 'check_disk --help' might be a good  
read. You want to add a '%' to those numbers.

>> It seems to me you're not receiving notifications because hard state
>> changes are not occurring. This is generally desired behavior.
>
> That doesn't really make sense to me.  I won't be alerted until the
> problem is fixed?  Or gets worse?

You'll be alerted when the service changes state by default. OK ->  
Warning, OK -> Critical, Warning -> Critical, Warning -> OK, Critical - 
 > OK. With a notification interval of 180, you should be re-notified  
every 180 seconds _but_ only if the service is in a non-OK state.  
You're not in a non-OK state so your next notification will be when a  
state change occurs to Warning or Critical.

http://nagios.sourceforge.net/docs/3_0/notifications.html

> Here's what I'd like to wind up with... if available disk space drops
> below a certain point, I'd like to have an alert go out maybe once per
> day.  If it drops past another point, into critical territory, I'd  
> like

You should have enough information to fix the disk check now. For the  
notifications, adjust notification_interval to be 86400 (1 day in  
seconds).

> alerts to be sent out more frequently.  But, whatever the interval is,

This is not possible AFAIK. notification_interval is the same, always.  
Having a shorter notification_interval and looking at Escalations  
might be a solution. Another would be to include that kind of logic in  
your notification script.

> nagios should be alerting each time it sees low disk space.  If it

Every check? If that's what you want then setting is_volatile would do  
it.

> alerts once, and then assumes that it never has to alert again unless
> the problem gets fixed and then reappears, it's never going to get
> fixed.  Once I have alerting working this way, I'll point the emails  
> at

That sounds like a people issue ;) Normally, that's the behavior but  
Escalations can help force the people issue.

--
Marc

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list