nagios freeze while a long time

Vanhee Frederik fvanhee at gmail.com
Sat Oct 27 19:30:34 CEST 2007


Detrak wrote:
> Hi,
>
>
> We use Nagios on several servers, in version 2.9 with NDO 1.4b5 and 
> perf2rdd (nagios write performance data in a pipe file and perf2rrd 
> perform it in rrd file). Running on RHEL4 with package from 
> dag.wieers.com <http://dag.wieers.com>
>
> we have 80 hosts and 420 services on this server.
>
>
>
> We can see some huge gaps  in our graphs, perf2rrd works fine, my 
> first investigation shows this message in nagios.log file :
> [1193178252] ndomod: Error writing to data sink!  Some output may get 
> lost...
> [1193178268] ndomod: Successfully reconnected to data sink!  0 items 
> lost, 240 queued items to flush.
> [1193178269] ndomod: Successfully flushed 240 queued items to data sink.
> [1193187298] Warning: A system time change of 8729 seconds (forwards 
> in time) has been detected.  Compensating...
> [1193190553] Warning: A system time change of 3255 seconds (forwards 
> in time) has been detected.  Compensating...
>
>
>
> we have recompiled nagios with debug mode :
> --enable-DEBUG2 shows warning messages
> --enable-DEBUG3 shows scheduled events
>
> we don't use le DEBUG0 because it generates too much informations and 
> the log file increases too fast.
>
>
> so, I found this message in debug information, with the last gap :
> - Masquer le texte des messages précédents -
>
>
> *** Event Check Loop ***
>         Current time: Wed Oct 24 00:29:29 2007
>         Next High Priority Event Time: Wed Oct 24 00:29:30 2007
>         Next Low Priority Event Time:  Wed Oct 24 00:29:29 2007
> Current/Max Outstanding Service Checks: 19/65
> *** Event Details ***
>         Event time: Wed Oct 24 00:29:29 2007
>         Event type: 0 (service check)
>                 Service Description: LOAD_AVERAGE at LOADAVERAGE
>                 Associated Host:     SGBD1
>         Checking service 'LOAD_AVERAGE at LOADAVERAGE' on host 'SGBD1'...
>
> - Masquer le texte des messages précédents -
> *** Event Check Loop ***
>         Current time: Wed Oct 24 00:29:29 2007
>         Next High Priority Event Time: Wed Oct 24 00:29:30 2007
>         Next Low Priority Event Time:  Wed Oct 24 00:29:29 2007
> Current/Max Outstanding Service Checks: 20/65
> *** Event Details ***
>         Event time: Wed Oct 24 00:29:29 2007
>         Event type: 0 (service check)
>                 Service Description: LOAD_AVERAGE at LOADAVERAGE
>                 Associated Host:     INTEG
>         Checking service 'LOAD_AVERAGE at LOADAVERAGE' on host 'INTEG'...
> Warning: A system time change of 8729 seconds (forwards in time) has 
> been detected.  Compensating...
>
> *** Event Check Loop ***
>         Current time: Wed Oct 24 02:54:58 2007
>         Next High Priority Event Time: Wed Oct 24 02:54:59 2007
>         Next Low Priority Event Time:  Wed Oct 24 02:54:58 2007
> Current/Max Outstanding Service Checks: 21/65
> *** Event Details ***
>         Event time: Wed Oct 24 02:54:58 2007
>         Event type: 0 (service check)
>                 Service Description: MONITOR_TELNET_SUIVI_PS
>                 Associated Host:     PREPROD1
>         Checking service 'MONITOR_TELNET_SUIVI_PS' on host 'PREPROD1'...
> Warning: A system time change of 3255 seconds (forwards in time) has 
> been detected.  Compensating...
>
> *** Event Check Loop ***
>         Current time: Wed Oct 24 03:49:13 2007
>         Next High Priority Event Time: Wed Oct 24 03:49:14 2007
>         Next Low Priority Event Time:  Wed Oct 24 03:49:13 2007
> Current/Max Outstanding Service Checks: 22/65
> *** Event Details ***
>         Event time: Wed Oct 24 03:49:13 2007
>         Event type: 0 (service check)
>                 Service Description: MONITOR_TELNET_SUIVI_PS
>                 Associated Host:     BIDS15
>         Checking service 'MONITOR_TELNET_SUIVI_PS' on host 'BIDS15'...
>
>
> we can see the jump
>     00:29:29 to 02:54:58
> and 02:54:58 to 03:49:13
>
> without activity in nagios! I dont understand this!
>
>
> if you can give me some help to have a nagios server with more 
> stability. I dont know how to reproduce this bug. At the time  a gap 
> was accuring, the server time was up to date.
>
> We have on this server more than a gap by day!
>
>
>
> best regards,
> Olivier
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> ------------------------------------------------------------------------
>
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
At least you should upgrade to ndoutils 1.4b6, this solves already one 
problem, the very frequent disconnects and reconnects to the data sink.

Frederik


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list