2.0 stable stops checking

Eli Stair estair at ilm.com
Fri Mar 17 22:26:05 CET 2006


FYI, my current config.  I'm also running Ganglia (gmetad), Cacti 
(cactid), and a number of other checks/scripts/processes concurrently ( 
as well as the apache frontend for them).  I'm not seeing _any_ issues 
with other processes on this system.

/eli

### Current errors:
In addition to unannounced failures of services:
status.cgi[32625]: segfault at 0000002a95c8b000 rip 00000036467716e0 rsp 
0000007fbffff148 error 4
extinfo.cgi[28957]: segfault at 0000002a95d45000 rip 00000036467716e0 
rsp 0000007fbfffecd8 error 4
extinfo.cgi[5263]: segfault at 0000002a956c9000 rip 00000036467716e0 rsp 
0000007fbfffed38 error 4


### System specs:

RHEL4.2 x86_64
8GB RAM
2x Opteron 280

GCC 3.4.4
Apache2.0.52

(quite unloaded:
free -m:
-/+ buffers/cache:       1544       6438

cat /proc/loadavg
1.19 1.22 1.17 1/309 8806
)

### Config command (compiled native 64-bit):

   $ ./configure --prefix=/usr/local/nagios --with-nagios-user=root 
--with-nagios-group=root

## --------- ##
## Platform. ##
## --------- ##

hostname = monitor02.lucasfilm.com
uname -m = x86_64
uname -r = 2.6.9-22.0.1.ELsmp
uname -s = Linux
uname -v = #1 SMP Thu Oct 27 14:49:37 CDT 2005


### Nagios output:

  ../bin/nagios -v nagios.cfg

Nagios 2.0
Copyright (c) 1999-2006 Ethan Galstad (http://www.nagios.org)
Last Modified: 02-07-2006
License: GPL

Reading configuration data...

Running pre-flight check on configuration data...

Warning: Size of service_message struct (528 bytes) is > 
POSIX-guaranteed atomic write size (512 bytes).  Service checks results 
may get lost or mangled!
Checking services...
         Checked 7397 services.
Checking hosts...
         Checked 754 hosts.
Checking host groups...
         Checked 14 host groups.
Checking service groups...
         Checked 5 service groups.
Checking contacts...
         Checked 6 contacts.
Checking contact groups...
Warning: Contact group 'swat' is not used in any host/service 
definitions or host/service escalations!
         Checked 4 contact groups.
Checking service escalations...
         Checked 0 service escalations.
Checking service dependencies...
         Checked 6454 service dependencies.
Checking host escalations...
         Checked 0 host escalations.
Checking host dependencies...
         Checked 0 host dependencies.
Checking commands...
         Checked 47 commands.
Checking time periods...
         Checked 4 time periods.
Checking extended host info definitions...
         Checked 461 extended host info definitions.
Checking extended service info definitions...
         Checked 0 extended service info definitions.
Checking for circular paths between hosts...
Checking for circular host and service dependencies...
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Total Warnings: 2
Total Errors:   0

Things look okay - No serious problems were detected during the 
pre-flight check


Terry wrote:
> Ya, I have tried that.   The nagios.log is not being updated with
> everything.  I have a ton of INITIAL STATE messages in there from the
> start but after that there are only 2 service checks in there but my
> status page says I have 29 OK services with 424 pending.  Huge
> inconsistencies.
> 
> After all is said and done, all I see in the process list ever are
> check_ping's for ONLY hosts in 1 hostgroup.  Weird.
> 
> 
> On 3/17/06, Eli Stair <estair at ilm.com> wrote:
> 
>>Are you in a position to stop services for a minute and check starting
>>up again with the retention.dat file moved out of the way?  If you're
>>hesitant you may want to start up another instance of Nagios in parallel
>>for testing it and such.  That's sane, but I've proven to myself enough
>>that this is always the case (in _my_ _current_ instance) and just have
>>to do it on the production system when I catch it.
>>
>>I'm real curious to find out if this is the same exact issue/resolution
>>that works for you as well.
>>
>>/eli
>>
>>Terry wrote:
>>
>>>No, not all checks.  I see check_ping processes still firing up:
>>>
>>>[root at plaut08 etc]# ps xauwwww -H| grep nagios  | grep -v grep
>>>nagios   26676 11.0  0.1 28620 3852 ?        Ssl  13:35   0:11
>>>/usr/bin/nagios -d /etc/nagios/nagios.cfg
>>>nagios   26814  0.0  0.1 28624 3852 ?        S    13:36   0:00
>>>/usr/bin/nagios -d /etc/nagios/nagios.cfg
>>>nagios   26815  0.0  0.0  4684  640 ?        S    13:36   0:00
>>>/usr/lib/nagios/plugins/check_ping -H 172.28.7.59 -w 3000.0,80%% -c
>>>5000.0,100%% -p 15 -t 30
>>>nagios   26816  0.0  0.0  2580  528 ?        S    13:36   0:00
>>>/bin/ping -n -U -w 90 -c 15 172.28.7.59
>>>
>>>
>>>I am seeing the same thing as you where only certain hosts/hostgroups
>>>are being checked and then all of a sudden everything stops BUT pings
>>>based on above but those checks are not being updated in nagios.log.
>>>Very weird.
>>>
>>>On 3/17/06, Eli Stair <estair at ilm.com> wrote:
>>>
>>>
>>>>So you're seeing the scenario where nagios stops _all_ checks
>>>>altogether?  I've had this happen when the nagios parent process dies,
>>>>and logs to nagios.log to this effect "[1139362901] Caught SIGSEGV,
>>>>shutting down... ".  I was getting these very frequently when I went
>>>>above some apparent host/service threshhold (went away when I removed
>>>>about 128 nodes at one point recently).  In these cases the CGI's still
>>>>respond for some reason, which seemed inappropriate...
>>>>
>>>>I've also seen the same symptom, but without a well-advertised nagios
>>>>failure, where the process is still present in memory but checks aren't
>>>>executed and the CGI's are functional.
>>>>
>>>>The third related (and my current bane...) issue is where MOST all
>>>>checks occur, but some (sometimes large) groups of unrelated actions no
>>>>longer occur.  Host/service checks as a whole seem to be working, but
>>>>I'll notice that I haven't gotten an alert for something that failed,
>>>>and then see that whole class of service checks on one hostgroup aren't
>>>>running anymore... and then start to see the same issue with other
>>>>checks/actions as well.
>>>>
>>>>I'd sure love to just have nagios start working again, as I'm strongly
>>>>against having to write an external framework for checking various parts
>>>>of Nagios and alerrt me when it's broken!  Alternately, I've always kept
>>>>up to date on other OS monitor/alert frameworks and still nothing is as
>>>>extensible as Nagios is (yet).
>>>>
>>>>/eli
>>>>
>>>>
>>>>Terry wrote:
>>>>
>>>>
>>>>>In just looking at the logs, the status.log is being continuously
>>>>>updated as normal but when checks stop, the nagios.log stops gathering
>>>>>entries as well.
>>>>>
>>>>>On 3/17/06, Eli Stair <estair at ilm.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>>I've been seeing this continuously in 2.0beta/rc/releases.  For details
>>>>>>on my situation/posts check the devel/users archives, I'm curious if any
>>>>>>similarities exist.  I haven't gotten acknowledgement/resolution on this
>>>>>>either, the only thing I've determined is that (in my case) stopping
>>>>>>nagios and restarting with the retention file zeroed resolves the issue
>>>>>>100%.
>>>>>>
>>>>>>In the case of having an extra nagios process running that can
>>>>>>definitely cause this and other issues.  In my case that's never been
>>>>>>present and thus not the cause...
>>>>>>
>>>>>>/eli
>>>>>>
>>>>>>Terry wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>I am seeing this as well.  I have services that do not get checked
>>>>>>>when they are scheduled:
>>>>>>>
>>>>>>>Last Check Type:      ACTIVE
>>>>>>>Last Check Time:      03-17-2006 08:50:47
>>>>>>>Status Data Age:      0d 1h 37m 51s
>>>>>>>Next Scheduled Active Check:          03-17-2006 10:09:01
>>>>>>>Latency:      342.408 seconds
>>>>>>>Check Duration:       10.015 seconds
>>>>>>>Last State Change:    03-16-2006 11:55:02
>>>>>>>Current State Duration:       0d 22h 33m 36s
>>>>>>>
>>>>>>>It is currently 10:29 and it still hasnt been checked.  This is one of
>>>>>>>many examples.
>>>>>>>
>>>>>>>On 3/15/06, Matthias Eble
>>>>>>><matthias.eble at mailing.kaufland-informationssysteme.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>hi all!
>>>>>>>>
>>>>>>>>we are experiencing occassional problems with nagios 2.0 stable. The
>>>>>>>>main process was reloaded due to configuration changes yesterday (Mar
>>>>>>>>14th). since then ps -ef looks like this:
>>>>>>>>
>>>>>>>>nagios    1078     1 12 Mar09 ?        16:49:43 /opt/nagios/bin/nagios
>>>>>>>>-d /opt/nagios/etc/nagios.cfg
>>>>>>>>nagios    9431  1078  0 Mar14 ?        00:00:00 [nagios] <defunct>
>>>>>>>>
>>>>>>>>and nagios stopped to check. Has anyone an idea what could have happened
>>>>>>>>? The nagios.log and status.dat files have not been updated since then.
>>>>>>>>
>>>>>>>>thanks
>>>>>>>>matthias
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>-------------------------------------------------------
>>>>>>>>This SF.Net email is sponsored by xPML, a groundbreaking scripting language
>>>>>>>>that extends applications into web and mobile media. Attend the live webcast
>>>>>>>>and join the prime developer group breaking into this new coding territory!
>>>>>>>>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
>>>>>>>>_______________________________________________
>>>>>>>>Nagios-users mailing list
>>>>>>>>Nagios-users at lists.sourceforge.net
>>>>>>>>https://lists.sourceforge.net/lists/listinfo/nagios-users
>>>>>>>>::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
>>>>>>>>::: Messages without supporting info will risk being sent to /dev/null
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>-------------------------------------------------------
>>>>>>>This SF.Net email is sponsored by xPML, a groundbreaking scripting language
>>>>>>>that extends applications into web and mobile media. Attend the live webcast
>>>>>>>and join the prime developer group breaking into this new coding territory!
>>>>>>>http://sel.as-us.falkag.net/sel?cmd=k&kid0944&bid$1720&dat1642
>>>>>>>_______________________________________________
>>>>>>>Nagios-users mailing list
>>>>>>>Nagios-users at lists.sourceforge.net
>>>>>>>https://lists.sourceforge.net/lists/listinfo/nagios-users
>>>>>>>::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
>>>>>>>::: Messages without supporting info will risk being sent to /dev/null
>>>>>>>
>>>>>>
>>>>>>
>>>>>-------------------------------------------------------
>>>>>This SF.Net email is sponsored by xPML, a groundbreaking scripting language
>>>>>that extends applications into web and mobile media. Attend the live webcast
>>>>>and join the prime developer group breaking into this new coding territory!
>>>>>http://sel.as-us.falkag.net/sel?cmd=k&kid0944&bid$1720&dat1642
>>>>>_______________________________________________
>>>>>Nagios-users mailing list
>>>>>Nagios-users at lists.sourceforge.net
>>>>>https://lists.sourceforge.net/lists/listinfo/nagios-users
>>>>>::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
>>>>>::: Messages without supporting info will risk being sent to /dev/null
>>>>>
>>>>
>>>>
>>
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting language
> that extends applications into web and mobile media. Attend the live webcast
> and join the prime developer group breaking into this new coding territory!
> http://sel.as-us.falkag.net/sel?cmd=k&kid0944&bid$1720&dat1642
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
> 



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list