HOST DOWN notification not getting resent

Quanah Gibson-Mount quanah at stanford.edu
Thu Aug 26 00:06:50 CEST 2004



--On Wednesday, August 25, 2004 10:47 AM +0200 Andreas Ericsson <ae at op5.se> 
wrote:

> Ok. To the bottom of this then.
> 1. Services will have nothing to do with it (except for failing, but you
> knew that already), so cut the line-noise.

Err, I was trying to cut the line noise by pointing out that services 
worked fine.  Should I instead remove all mention of services, and assume 
that everyone on the list will magically realize that my services are 
working just fine, and it is only the host down notifications that are a 
problem?

> 2. Is the host down or unreachable?

Yes.  Poweroff is a very nice command.

> 3. Are you positive the host hasn't gone to flapping state? Nagios 1.x
> doesn't notify for this. Nagios 2.0 has an option to do so.

Yes, absolutely positive.  I can run a ping from another window that 
consistently shows the host never returning anything.

> 4. You're sure you haven't set notification_interval to 0 in the host
> object definition (or anywhere else, for that matter)?

Yes.  Especially since it is quite happy to send the *first* host down 
alert, just not any following alerts.

> 5. You're sure nothing is wrong with the way notifications are sent?

Yes, because all service notifications are sent correctly, for hours on 
end, if a host is up and its services have problems.

> 6. Have you tried running Nagios as a foreground process while producing
> errors like this in the configuration?

I'm not quite sure what you mean here.  We always check Nagios through "-v" 
before we apply our configuration, and our script that applies our 
configuration won't let you install a bad configuration.  So I'm not sure 
what "errors likes this in the configuration" you are referring to?

> 7. Have you tried increasing the notification interval? I'm not sure what
> happens if Nagios 'misses' a scheduled notification, but it might just
> happily skip it and move on.

Our normal notification interval is 30 minutes for hosts.  It doesn't work 
at that setting either.

> 8. What's the normal load on the machine you're running Nagios at?

3-4 in the Solaris world.  Note again that all service checks work just 
fine at this load level.

> 9. Are you using the default notification commands, or have you written
> your own ones? If so, do they adhere to the NOTIFICATIONNUMBER macro?

I'm using the default notification command that came with Nagios.

> 10. Do you have a spamfilter in place? If so, remove it.

No, I do not.

> 11. Add an extra nofification command that looks like so;
> define command{
> command_name notification_stamp
> command_line date "+%Y:%m:%d %H:%M:%S" >>
> /home/quanah/Notifications.Timestamp
> }

I'll be happy to do that (to a different directory though).

> (mind the new-line) and make this the notification of choice for a
> lab-host you're trying. Watch the file grow if the host is down.

I'll watch and see 'if' the file grows when the host is down. ;)

> 12. If all of the above fails, try it again.

Um yeah, we've been dealing with this for about 8 months now.

> 13. If you're still out of luck then set up the simplest possible
> configuration (one host that you can bring up and down at wish), and make
> sure several notifications go out before you move to more advanced
> configuration. Make a host-template that you KNOW works with this, and
> use it for all hosts you want to resend notifications with.

I'll do that as soon as I have a secondary host to fiddle with the 
configurations on.  I can't just take out our production monitoring 
service. ;)

> 14. Use the default nagios.cfg-file, just to be on the safe side.

I'll combine that with 13.

> 15. If problems persist, debug your mail-spooler.

My mail spooler is just fine.  It sends out hundreds of messages from 
Nagios every day.

> 16. If problems still persist, debug any relayhosts the mail passes
> through.

That would assume that all alerts were problematic.  They aren't.  There is 
only one type of alert that is problematic.  And I always get *one* of 
those alerts, just no more or no less.

> 17. If the problems still persist, buy 3 hours of support from someone,
> and send them your configuration in a gzipped tarball.

No thanks.

Before even implementing Nagios here at Stanford, I read through the 
configuration files & played with the setup for a few weeks.  Then we 
implemented it, and pushed it out.  The configuration pieces are rather 
simple, and the documentation was quite thorough.  I'm not some 2-bit hack 
who has problems understanding command prompts, etc. I've been 
administering UNIX based systems & applications for over 10 years.  I've 
yet to see anyone be able to find anything in our configuration that 
explains Nagios' behavior.  Personally, I think it is a bug in Nagios 
running under Solaris, and I've yet to see anything that contradicts that 
assumption at all.  We will be moving our Nagios service onto Debian soon, 
and I'm most curious to see if the problem disappears at that time.  If it 
does, then at least I'll be able to point at the root cause.

--Quanah

--
Quanah Gibson-Mount
Principal Software Developer
ITSS/Shared Services
Stanford University
GnuPG Public Key: http://www.stanford.edu/~quanah/pgp.html


-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list