2.0 stable stops checking

Terry td3201 at gmail.com
Fri Mar 17 21:09:18 CET 2006


Ya, I have tried that.   The nagios.log is not being updated with
everything.  I have a ton of INITIAL STATE messages in there from the
start but after that there are only 2 service checks in there but my
status page says I have 29 OK services with 424 pending.  Huge
inconsistencies.

After all is said and done, all I see in the process list ever are
check_ping's for ONLY hosts in 1 hostgroup.  Weird.


On 3/17/06, Eli Stair <estair at ilm.com> wrote:
>
> Are you in a position to stop services for a minute and check starting
> up again with the retention.dat file moved out of the way?  If you're
> hesitant you may want to start up another instance of Nagios in parallel
> for testing it and such.  That's sane, but I've proven to myself enough
> that this is always the case (in _my_ _current_ instance) and just have
> to do it on the production system when I catch it.
>
> I'm real curious to find out if this is the same exact issue/resolution
> that works for you as well.
>
> /eli
>
> Terry wrote:
> > No, not all checks.  I see check_ping processes still firing up:
> >
> > [root at plaut08 etc]# ps xauwwww -H| grep nagios  | grep -v grep
> > nagios   26676 11.0  0.1 28620 3852 ?        Ssl  13:35   0:11
> > /usr/bin/nagios -d /etc/nagios/nagios.cfg
> > nagios   26814  0.0  0.1 28624 3852 ?        S    13:36   0:00
> > /usr/bin/nagios -d /etc/nagios/nagios.cfg
> > nagios   26815  0.0  0.0  4684  640 ?        S    13:36   0:00
> > /usr/lib/nagios/plugins/check_ping -H 172.28.7.59 -w 3000.0,80%% -c
> > 5000.0,100%% -p 15 -t 30
> > nagios   26816  0.0  0.0  2580  528 ?        S    13:36   0:00
> > /bin/ping -n -U -w 90 -c 15 172.28.7.59
> >
> >
> > I am seeing the same thing as you where only certain hosts/hostgroups
> > are being checked and then all of a sudden everything stops BUT pings
> > based on above but those checks are not being updated in nagios.log.
> > Very weird.
> >
> > On 3/17/06, Eli Stair <estair at ilm.com> wrote:
> >
> >>So you're seeing the scenario where nagios stops _all_ checks
> >>altogether?  I've had this happen when the nagios parent process dies,
> >>and logs to nagios.log to this effect "[1139362901] Caught SIGSEGV,
> >>shutting down... ".  I was getting these very frequently when I went
> >>above some apparent host/service threshhold (went away when I removed
> >>about 128 nodes at one point recently).  In these cases the CGI's still
> >>respond for some reason, which seemed inappropriate...
> >>
> >>I've also seen the same symptom, but without a well-advertised nagios
> >>failure, where the process is still present in memory but checks aren't
> >>executed and the CGI's are functional.
> >>
> >>The third related (and my current bane...) issue is where MOST all
> >>checks occur, but some (sometimes large) groups of unrelated actions no
> >>longer occur.  Host/service checks as a whole seem to be working, but
> >>I'll notice that I haven't gotten an alert for something that failed,
> >>and then see that whole class of service checks on one hostgroup aren't
> >>running anymore... and then start to see the same issue with other
> >>checks/actions as well.
> >>
> >>I'd sure love to just have nagios start working again, as I'm strongly
> >>against having to write an external framework for checking various parts
> >>of Nagios and alerrt me when it's broken!  Alternately, I've always kept
> >>up to date on other OS monitor/alert frameworks and still nothing is as
> >>extensible as Nagios is (yet).
> >>
> >>/eli
> >>
> >>
> >>Terry wrote:
> >>
> >>>In just looking at the logs, the status.log is being continuously
> >>>updated as normal but when checks stop, the nagios.log stops gathering
> >>>entries as well.
> >>>
> >>>On 3/17/06, Eli Stair <estair at ilm.com> wrote:
> >>>
> >>>
> >>>>I've been seeing this continuously in 2.0beta/rc/releases.  For details
> >>>>on my situation/posts check the devel/users archives, I'm curious if any
> >>>>similarities exist.  I haven't gotten acknowledgement/resolution on this
> >>>>either, the only thing I've determined is that (in my case) stopping
> >>>>nagios and restarting with the retention file zeroed resolves the issue
> >>>>100%.
> >>>>
> >>>>In the case of having an extra nagios process running that can
> >>>>definitely cause this and other issues.  In my case that's never been
> >>>>present and thus not the cause...
> >>>>
> >>>>/eli
> >>>>
> >>>>Terry wrote:
> >>>>
> >>>>
> >>>>>I am seeing this as well.  I have services that do not get checked
> >>>>>when they are scheduled:
> >>>>>
> >>>>>Last Check Type:      ACTIVE
> >>>>>Last Check Time:      03-17-2006 08:50:47
> >>>>>Status Data Age:      0d 1h 37m 51s
> >>>>>Next Scheduled Active Check:          03-17-2006 10:09:01
> >>>>>Latency:      342.408 seconds
> >>>>>Check Duration:       10.015 seconds
> >>>>>Last State Change:    03-16-2006 11:55:02
> >>>>>Current State Duration:       0d 22h 33m 36s
> >>>>>
> >>>>>It is currently 10:29 and it still hasnt been checked.  This is one of
> >>>>>many examples.
> >>>>>
> >>>>>On 3/15/06, Matthias Eble
> >>>>><matthias.eble at mailing.kaufland-informationssysteme.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>>hi all!
> >>>>>>
> >>>>>>we are experiencing occassional problems with nagios 2.0 stable. The
> >>>>>>main process was reloaded due to configuration changes yesterday (Mar
> >>>>>>14th). since then ps -ef looks like this:
> >>>>>>
> >>>>>>nagios    1078     1 12 Mar09 ?        16:49:43 /opt/nagios/bin/nagios
> >>>>>>-d /opt/nagios/etc/nagios.cfg
> >>>>>>nagios    9431  1078  0 Mar14 ?        00:00:00 [nagios] <defunct>
> >>>>>>
> >>>>>>and nagios stopped to check. Has anyone an idea what could have happened
> >>>>>>? The nagios.log and status.dat files have not been updated since then.
> >>>>>>
> >>>>>>thanks
> >>>>>>matthias
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>-------------------------------------------------------
> >>>>>>This SF.Net email is sponsored by xPML, a groundbreaking scripting language
> >>>>>>that extends applications into web and mobile media. Attend the live webcast
> >>>>>>and join the prime developer group breaking into this new coding territory!
> >>>>>>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> >>>>>>_______________________________________________
> >>>>>>Nagios-users mailing list
> >>>>>>Nagios-users at lists.sourceforge.net
> >>>>>>https://lists.sourceforge.net/lists/listinfo/nagios-users
> >>>>>>::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
> >>>>>>::: Messages without supporting info will risk being sent to /dev/null
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>-------------------------------------------------------
> >>>>>This SF.Net email is sponsored by xPML, a groundbreaking scripting language
> >>>>>that extends applications into web and mobile media. Attend the live webcast
> >>>>>and join the prime developer group breaking into this new coding territory!
> >>>>>http://sel.as-us.falkag.net/sel?cmd=k&kid0944&bid$1720&dat1642
> >>>>>_______________________________________________
> >>>>>Nagios-users mailing list
> >>>>>Nagios-users at lists.sourceforge.net
> >>>>>https://lists.sourceforge.net/lists/listinfo/nagios-users
> >>>>>::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
> >>>>>::: Messages without supporting info will risk being sent to /dev/null
> >>>>>
> >>>>
> >>>>
> >>>
> >>>-------------------------------------------------------
> >>>This SF.Net email is sponsored by xPML, a groundbreaking scripting language
> >>>that extends applications into web and mobile media. Attend the live webcast
> >>>and join the prime developer group breaking into this new coding territory!
> >>>http://sel.as-us.falkag.net/sel?cmd=k&kid0944&bid$1720&dat1642
> >>>_______________________________________________
> >>>Nagios-users mailing list
> >>>Nagios-users at lists.sourceforge.net
> >>>https://lists.sourceforge.net/lists/listinfo/nagios-users
> >>>::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
> >>>::: Messages without supporting info will risk being sent to /dev/null
> >>>
> >>
> >>
> >
>
>


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list