Nagios stopped checking most of my services!

Martin Melin mmelin at gmail.com
Tue Nov 3 08:15:30 CET 2009


This is a confirmed bug in 3.2.0. We Europeans noticed this last week when
we switched from DST.

Start of thread on the problem:
http://www.mail-archive.com/nagios-users@lists.sourceforge.net/msg29695.html

There are a bunch of suggestions there on how to get your checks scheduled
sanely again without losing acknowledgements etc.

Regards,
Martin Melin

On Tue, Nov 3, 2009 at 5:03 AM, Frost, Mark {PBG} <mark.frost1 at pepsi.com>wrote:

>  I guess I'm another "me too".  We use Nagios 3.0.6, but I had just setup
> an upgrade to 3.2.0.  From what I could see, my distributed nodes had been
> sending data just fine for 45 minutes or so.  When I double-checked the
> performance graphs just before retiring for the night I saw no data coming
> in.  When I traced this back to the distributed nodes, their scheduling
> queue showed no checks scheduled until well into the next day.  We have some
> checks that run as frequently as every minute.
>
>
>
> I assumed this was a weird bug with 3.2.0, panicked and went back to 3.0.6
> a little after midnight and things have been fine ever since.  I was going
> to spend more time observing 3.2.0 in a more contained environment to see if
> this was normal behavior.  My timing (checks stopping around 11pm Sunday
> night) sounds the same so perhaps it's not just my imagination.
>
>
>
> One thing that bothered me a bit was that I didn't see messages in the
> central servers indicating that it was marking service checks as stale and
> checking automatically.  I saw no stale messages in the log and it should
> have been well past the freshness thresholds of most checks.  As I say, it
> was late and I decided to roll back before I investigated.
>
>
>
> I've got thousands of service checks so forcing rescheduling wouldn't work
> for me.
>
>
>
> Mark
>
>
>
> *From:* Les Fenison [mailto:les at deltatechnicalservices.com]
> *Sent:* Monday, November 02, 2009 9:47 PM
> *To:* Andy Howell
> *Cc:* nagios-users at lists.sourceforge.net
> *Subject:* Re: [Nagios-users] Nagios stopped checking most of my services!
>
>
>
> Well, so far 3 of us with the same problem on the same day.  I have to
> believe it is daylight savings time related.
>
> My fix is to go click on each service one by one and reschedule.  Then they
> start checking normally again.
>
> I wonder if there is anyway to force an automatic reschedule of all
> services and hosts for next year when this happens again?
>
> Andy Howell wrote:
>
> Les Fenison wrote:
>
>  I had nagios working great.  Checking 6 hosts and about 85 services.
> Then suddenly, all services on all hosts except one stopped checking.  The
> next scheduled check is about 24 hours from the last check.  I had been
> checking every 5 minutes.
>
> Restarting nagios didn't help.    I am using a gui NagioSQL to edit my
> configuration files so I suspect it did something to me but I have no clue
> where to look except where I have already looked.
>
> What can cause nagios to just stop checking everything like that or to
> randomly switch to every 24 hours rather than the configured every 5
> minutes?
>
> I am having to manually do force checks to get it to check.
>
> Here are some things I have checked...
>
> Hosts  check_interval is 5, retry_interval is 1
> Services  check_interval is 10, retry_interval is 2
>
> So where could Nagios be getting the idea that it is suppose to be every 24
> hours?
>
>
> I had the same experience yesterday. Maybe daylight savings related? At
> about 11pm, all the services were scheduled for 11pm the following day. I
> figured it was something I did wrong. I noticed that "next_check" time in
> /var/log/nagios/retention.dat was wrong. I renamed the file and restarted
> nagios. It worked fine after that.
>
> I using version 3.2.
>
> Regards,
>
>     Andy
>
>
>
> --
>  ------------------------------
>
> Les Fenison
> Delta Technical Services
> www.DeltaTechnicalServices.com
> les at DeltaTechnicalServices.com
> 503-766-0076
>
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20091103/5b795ee7/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list