2.0 stable stops checking

Eli Stair estair at ilm.com
Fri Mar 17 21:00:28 CET 2006


Are you in a position to stop services for a minute and check starting 
up again with the retention.dat file moved out of the way?  If you're 
hesitant you may want to start up another instance of Nagios in parallel 
for testing it and such.  That's sane, but I've proven to myself enough 
that this is always the case (in _my_ _current_ instance) and just have 
to do it on the production system when I catch it.

I'm real curious to find out if this is the same exact issue/resolution 
that works for you as well.

/eli

Terry wrote:
> No, not all checks.  I see check_ping processes still firing up:
> 
> [root at plaut08 etc]# ps xauwwww -H| grep nagios  | grep -v grep
> nagios   26676 11.0  0.1 28620 3852 ?        Ssl  13:35   0:11  
> /usr/bin/nagios -d /etc/nagios/nagios.cfg
> nagios   26814  0.0  0.1 28624 3852 ?        S    13:36   0:00    
> /usr/bin/nagios -d /etc/nagios/nagios.cfg
> nagios   26815  0.0  0.0  4684  640 ?        S    13:36   0:00      
> /usr/lib/nagios/plugins/check_ping -H 172.28.7.59 -w 3000.0,80%% -c
> 5000.0,100%% -p 15 -t 30
> nagios   26816  0.0  0.0  2580  528 ?        S    13:36   0:00        
> /bin/ping -n -U -w 90 -c 15 172.28.7.59
> 
> 
> I am seeing the same thing as you where only certain hosts/hostgroups
> are being checked and then all of a sudden everything stops BUT pings
> based on above but those checks are not being updated in nagios.log. 
> Very weird.
> 
> On 3/17/06, Eli Stair <estair at ilm.com> wrote:
> 
>>So you're seeing the scenario where nagios stops _all_ checks
>>altogether?  I've had this happen when the nagios parent process dies,
>>and logs to nagios.log to this effect "[1139362901] Caught SIGSEGV,
>>shutting down... ".  I was getting these very frequently when I went
>>above some apparent host/service threshhold (went away when I removed
>>about 128 nodes at one point recently).  In these cases the CGI's still
>>respond for some reason, which seemed inappropriate...
>>
>>I've also seen the same symptom, but without a well-advertised nagios
>>failure, where the process is still present in memory but checks aren't
>>executed and the CGI's are functional.
>>
>>The third related (and my current bane...) issue is where MOST all
>>checks occur, but some (sometimes large) groups of unrelated actions no
>>longer occur.  Host/service checks as a whole seem to be working, but
>>I'll notice that I haven't gotten an alert for something that failed,
>>and then see that whole class of service checks on one hostgroup aren't
>>running anymore... and then start to see the same issue with other
>>checks/actions as well.
>>
>>I'd sure love to just have nagios start working again, as I'm strongly
>>against having to write an external framework for checking various parts
>>of Nagios and alerrt me when it's broken!  Alternately, I've always kept
>>up to date on other OS monitor/alert frameworks and still nothing is as
>>extensible as Nagios is (yet).
>>
>>/eli
>>
>>
>>Terry wrote:
>>
>>>In just looking at the logs, the status.log is being continuously
>>>updated as normal but when checks stop, the nagios.log stops gathering
>>>entries as well.
>>>
>>>On 3/17/06, Eli Stair <estair at ilm.com> wrote:
>>>
>>>
>>>>I've been seeing this continuously in 2.0beta/rc/releases.  For details
>>>>on my situation/posts check the devel/users archives, I'm curious if any
>>>>similarities exist.  I haven't gotten acknowledgement/resolution on this
>>>>either, the only thing I've determined is that (in my case) stopping
>>>>nagios and restarting with the retention file zeroed resolves the issue
>>>>100%.
>>>>
>>>>In the case of having an extra nagios process running that can
>>>>definitely cause this and other issues.  In my case that's never been
>>>>present and thus not the cause...
>>>>
>>>>/eli
>>>>
>>>>Terry wrote:
>>>>
>>>>
>>>>>I am seeing this as well.  I have services that do not get checked
>>>>>when they are scheduled:
>>>>>
>>>>>Last Check Type:      ACTIVE
>>>>>Last Check Time:      03-17-2006 08:50:47
>>>>>Status Data Age:      0d 1h 37m 51s
>>>>>Next Scheduled Active Check:          03-17-2006 10:09:01
>>>>>Latency:      342.408 seconds
>>>>>Check Duration:       10.015 seconds
>>>>>Last State Change:    03-16-2006 11:55:02
>>>>>Current State Duration:       0d 22h 33m 36s
>>>>>
>>>>>It is currently 10:29 and it still hasnt been checked.  This is one of
>>>>>many examples.
>>>>>
>>>>>On 3/15/06, Matthias Eble
>>>>><matthias.eble at mailing.kaufland-informationssysteme.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>>hi all!
>>>>>>
>>>>>>we are experiencing occassional problems with nagios 2.0 stable. The
>>>>>>main process was reloaded due to configuration changes yesterday (Mar
>>>>>>14th). since then ps -ef looks like this:
>>>>>>
>>>>>>nagios    1078     1 12 Mar09 ?        16:49:43 /opt/nagios/bin/nagios
>>>>>>-d /opt/nagios/etc/nagios.cfg
>>>>>>nagios    9431  1078  0 Mar14 ?        00:00:00 [nagios] <defunct>
>>>>>>
>>>>>>and nagios stopped to check. Has anyone an idea what could have happened
>>>>>>? The nagios.log and status.dat files have not been updated since then.
>>>>>>
>>>>>>thanks
>>>>>>matthias
>>>>>>
>>>>>>
>>>>>>
>>>>>>-------------------------------------------------------
>>>>>>This SF.Net email is sponsored by xPML, a groundbreaking scripting language
>>>>>>that extends applications into web and mobile media. Attend the live webcast
>>>>>>and join the prime developer group breaking into this new coding territory!
>>>>>>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
>>>>>>_______________________________________________
>>>>>>Nagios-users mailing list
>>>>>>Nagios-users at lists.sourceforge.net
>>>>>>https://lists.sourceforge.net/lists/listinfo/nagios-users
>>>>>>::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
>>>>>>::: Messages without supporting info will risk being sent to /dev/null
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>-------------------------------------------------------
>>>>>This SF.Net email is sponsored by xPML, a groundbreaking scripting language
>>>>>that extends applications into web and mobile media. Attend the live webcast
>>>>>and join the prime developer group breaking into this new coding territory!
>>>>>http://sel.as-us.falkag.net/sel?cmd=k&kid0944&bid$1720&dat1642
>>>>>_______________________________________________
>>>>>Nagios-users mailing list
>>>>>Nagios-users at lists.sourceforge.net
>>>>>https://lists.sourceforge.net/lists/listinfo/nagios-users
>>>>>::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
>>>>>::: Messages without supporting info will risk being sent to /dev/null
>>>>>
>>>>
>>>>
>>>
>>>-------------------------------------------------------
>>>This SF.Net email is sponsored by xPML, a groundbreaking scripting language
>>>that extends applications into web and mobile media. Attend the live webcast
>>>and join the prime developer group breaking into this new coding territory!
>>>http://sel.as-us.falkag.net/sel?cmd=k&kid0944&bid$1720&dat1642
>>>_______________________________________________
>>>Nagios-users mailing list
>>>Nagios-users at lists.sourceforge.net
>>>https://lists.sourceforge.net/lists/listinfo/nagios-users
>>>::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
>>>::: Messages without supporting info will risk being sent to /dev/null
>>>
>>
>>
> 



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list