nagios freeze while a long time

Detrak detrak at caere.fr
Wed Oct 24 10:34:01 CEST 2007


Hi,


We use Nagios on several servers, in version 2.9 with NDO 1.4b5 and perf2rdd
(nagios write performance data in a pipe file and perf2rrd perform it in rrd
file). Running on RHEL4 with package from dag.wieers.com

we have 80 hosts and 420 services on this server.



We can see some huge gaps  in our graphs, perf2rrd works fine, my first
investigation shows this message in nagios.log file :
[1193178252] ndomod: Error writing to data sink!  Some output may get
lost...
[1193178268] ndomod: Successfully reconnected to data sink!  0 items lost,
240 queued items to flush.
[1193178269] ndomod: Successfully flushed 240 queued items to data sink.
[1193187298] Warning: A system time change of 8729 seconds (forwards in
time) has been detected.  Compensating...
[1193190553] Warning: A system time change of 3255 seconds (forwards in
time) has been detected.  Compensating...



we have recompiled nagios with debug mode :
--enable-DEBUG2 shows warning messages
--enable-DEBUG3 shows scheduled events

we don't use le DEBUG0 because it generates too much informations and the
log file increases too fast.


so, I found this message in debug information, with the last gap :
- Masquer le texte des messages précédents -


*** Event Check Loop ***
        Current time: Wed Oct 24 00:29:29 2007
        Next High Priority Event Time: Wed Oct 24 00:29:30 2007
        Next Low Priority Event Time:  Wed Oct 24 00:29:29 2007
Current/Max Outstanding Service Checks: 19/65
*** Event Details ***
        Event time: Wed Oct 24 00:29:29 2007
        Event type: 0 (service check)
                Service Description: LOAD_AVERAGE at LOADAVERAGE
                Associated Host:     SGBD1
        Checking service 'LOAD_AVERAGE at LOADAVERAGE' on host 'SGBD1'...

- Masquer le texte des messages précédents -
*** Event Check Loop ***
        Current time: Wed Oct 24 00:29:29 2007
        Next High Priority Event Time: Wed Oct 24 00:29:30 2007
        Next Low Priority Event Time:  Wed Oct 24 00:29:29 2007
Current/Max Outstanding Service Checks: 20/65
*** Event Details ***
        Event time: Wed Oct 24 00:29:29 2007
        Event type: 0 (service check)
                Service Description: LOAD_AVERAGE at LOADAVERAGE
                Associated Host:     INTEG
        Checking service 'LOAD_AVERAGE at LOADAVERAGE' on host 'INTEG'...
Warning: A system time change of 8729 seconds (forwards in time) has been
detected.  Compensating...

*** Event Check Loop ***
        Current time: Wed Oct 24 02:54:58 2007
        Next High Priority Event Time: Wed Oct 24 02:54:59 2007
        Next Low Priority Event Time:  Wed Oct 24 02:54:58 2007
Current/Max Outstanding Service Checks: 21/65
*** Event Details ***
        Event time: Wed Oct 24 02:54:58 2007
        Event type: 0 (service check)
                Service Description: MONITOR_TELNET_SUIVI_PS
                Associated Host:     PREPROD1
        Checking service 'MONITOR_TELNET_SUIVI_PS' on host 'PREPROD1'...
Warning: A system time change of 3255 seconds (forwards in time) has been
detected.  Compensating...

*** Event Check Loop ***
        Current time: Wed Oct 24 03:49:13 2007
        Next High Priority Event Time: Wed Oct 24 03:49:14 2007
        Next Low Priority Event Time:  Wed Oct 24 03:49:13 2007
Current/Max Outstanding Service Checks: 22/65
*** Event Details ***
        Event time: Wed Oct 24 03:49:13 2007
        Event type: 0 (service check)
                Service Description: MONITOR_TELNET_SUIVI_PS
                Associated Host:     BIDS15
        Checking service 'MONITOR_TELNET_SUIVI_PS' on host 'BIDS15'...


we can see the jump
    00:29:29 to 02:54:58
and 02:54:58 to 03:49:13

without activity in nagios! I dont understand this!


if you can give me some help to have a nagios server with more stability. I
dont know how to reproduce this bug. At the time  a gap was accuring, the
server time was up to date.

We have on this server more than a gap by day!



best regards,
Olivier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20071024/89ae73bd/attachment.html>
-------------- next part --------------
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list