Unpredictable service check times fixed?

Mateo Carr mcarr at apple.com
Tue Apr 15 04:30:06 CEST 2003


I have been experiencing this problem as well. We have 46 hosts running  
330 services currently configured in nagios. I expect this to grow to  
about 300 hosts and over 1800 service checks assuming we can get this  
issue resolved.

I re-compiled with the --enable-DEBUG3 as suggested. Included is the  
relevant output displayed to the screen (by running nagios w/out the -d  
option) and associated nagios.log output.

nagios.log sample:
[1050363305] Warning: The check of service 'Root Volume Usage' on host  
'webx02' could not be performed due to a fork() error.  The check will  
be rescheduled.
[1050363305] Warning: The check of service 'NFS' on host 'webx01' could  
not be performed due to a fork() error.  The check will be rescheduled.
[1050363305] Warning: The check of service 'Syslogd' on host 'webx03'  
could not be performed due to a fork() error.  The check will be  
rescheduled.
[1050363305] Warning: The check of service 'CLOSE_WAITS' on host  
'webx05' could not be performed due to a fork() error.  The check will  
be rescheduled.
[1050363305] Warning: The check of service 'Cron' on host '<snip> 01'  
could not be performed due to a fork() error.  The check will be  
rescheduled.
[1050363305] Warning: The check of service 'NFS' on host 'webx09' could  
not be performed due to a fork() error.  The check will be rescheduled.
[1050363305] Warning: The check of service 'Clock_Drift' on host  
'webx06' could not be performed due to a fork() error.  The check will  
be rescheduled.
[1050363305] Warning: The check of service 'Cron' on host '<snip>03'  
could not be performed due to a fork() error.  The check will be  
rescheduled.
etc.....

output on the screen:
*** Event Check Loop ***
         Current time: Mon Apr 14 16:35:05 2003
         Next High Priority Event Time: Mon Apr 14 16:35:12 2003
         Next Low Priority Event Time:  Mon Apr 14 16:34:37 2003
Current/Max Outstanding Checks: 106/0
*** Event Details ***
         Event type: 0 (service check)
                 Service Description: Root Volume Usage
                 Associated Host:     webx02
         Event time: Mon Apr 14 16:34:37 2003
         Checking service 'Root Volume Usage' on host 'webx02'...
         Input: check_nrpe!check_root_disk
         Output: $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -to 60
Warning: The check of service 'Root Volume Usage' on host 'webx02'  
could not be performed due to a fork() error.  The check will be  
rescheduled.
         Preferred Time: 1050363305 --> Mon Apr 14 16:35:05 2003
         Next Valid Time: 1050363305 --> Mon Apr 14 16:35:05 2003

*** Event Check Loop ***
         Current time: Mon Apr 14 16:35:05 2003
         Next High Priority Event Time: Mon Apr 14 16:35:12 2003
         Next Low Priority Event Time:  Mon Apr 14 16:34:37 2003
Current/Max Outstanding Checks: 107/0
*** Event Details ***
         Event type: 0 (service check)
                 Service Description: NFS
                 Associated Host:     webx01
         Event time: Mon Apr 14 16:34:37 2003
         Checking service 'NFS' on host 'webx01'...
         Input: check_nrpe!check_nfs_hang
         Output: $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -to 60
Warning: The check of service 'NFS' on host 'webx01' could not be  
performed due to a fork() error.  The check will be rescheduled.
         Preferred Time: 1050363305 --> Mon Apr 14 16:35:05 2003
         Next Valid Time: 1050363305 --> Mon Apr 14 16:35:05 2003

*** Event Check Loop ***
         Current time: Mon Apr 14 16:35:05 2003
         Next High Priority Event Time: Mon Apr 14 16:35:12 2003
         Next Low Priority Event Time:  Mon Apr 14 16:34:38 2003
Current/Max Outstanding Checks: 108/0
*** Event Details ***
         Event type: 0 (service check)
                 Service Description: Syslogd
                 Associated Host:     webx03
         Event time: Mon Apr 14 16:34:38 2003
         Checking service 'Syslogd' on host 'webx03'...
         Input: check_nrpe!check_syslog
         Output: $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -to 60
Warning: The check of service 'Syslogd' on host 'webx03' could not be  
performed due to a fork() error.  The check will be rescheduled.
         Preferred Time: 1050363305 --> Mon Apr 14 16:35:05 2003
         Next Valid Time: 1050363305 --> Mon Apr 14 16:35:05 2003

*** Event Check Loop ***
         Current time: Mon Apr 14 16:35:05 2003
         Next High Priority Event Time: Mon Apr 14 16:35:12 2003
         Next Low Priority Event Time:  Mon Apr 14 16:34:38 2003
Current/Max Outstanding Checks: 109/0
*** Event Details ***
         Event type: 0 (service check)
                 Service Description: CLOSE_WAITS
                 Associated Host:     webx05
         Event time: Mon Apr 14 16:34:38 2003
         Checking service 'CLOSE_WAITS' on host 'webx05'...
         Input: check_nrpe!check_close_wait
         Output: $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -to 60
Warning: The check of service 'CLOSE_WAITS' on host 'webx05' could not  
be performed due to a fork() error.  The check will be rescheduled.
         Preferred Time: 1050363305 --> Mon Apr 14 16:35:05 2003
         Next Valid Time: 1050363305 --> Mon Apr 14 16:35:05 2003

etc.....you get the point.

A restart appears to clear up the problem for about 3 to 4 hours.

Any light that could be shed on why this is happening would be very  
much appreciated.

Thanks!

Mateo Carr
Systems Engineer
Apple Computer, Inc.
mcarr at apple.com


On Saturday, April 12, 2003, at 10:20  PM, Stanley Hopcroft wrote:

> Dear Sir,
>
> I am writing to thank you for your letter and say,
>
> 0 If you are not using Nagios-1.0 then please try that, otherwise
>
> 1 You may have found a bug as you say in the scheduler.
>
> However, there are _many_ Nag installations monitoring far more hosts
> and services without problems. (Here ~ 200 hosts and 350 services).
>
> If that is the case, the only way you can demonstrate the bug is by
> setting up a Test Nag environment - it could be your production
> environment since that is exhibiting the problem - and run Nagios in
> such a way that you can collect debug information.
>
> This is probably easiest done by rebuilding Nag with the appropriate
> debug config option
>
> (./configure --help
>   ..
> --enable-DEBUG0 shows function entry and exit
> --enable-DEBUG1 shows general info messages
> --enable-DEBUG2 shows warning messages
> --enable-DEBUG3 shows scheduled events (service and host checks... etc)
> --enable-DEBUG4 shows service and host notifications
> --enable-DEBUG5 shows SQL queries
>
> so probably DEBUG3)
>
> then run Nag in foreground (no -d) and post the parts of the log that
> show scheduling anomalies.
>
> Alternatively, modify the plugin of the service that seems to be
> suffering the most severe scheduling delays to log it's invocation and
> exit.
>
> This probably means adding code like (to a C plugin)
>
> +time_t   my_clock;
> +clock = time() ;
> +fprintf(stderr, "myPlugin started at %s." ctime(&clock)) ;
>
>  ...
>
> +clock = time() ;
> +fprintf(stderr, "myPlugin finished at %s.", ctime(&clock)) ;
>
> recompiling it and installing it - probably under a new name - in the
> Nag libexec directory.
>
> 2 If you want a tactical dumpb solution, a cron job that sends a hangup
> signal to Nagios periodically (or restarts it).
>
> You probably want to post the relevant parts of nagios.cfg also.
>
> Yours sincerely.
>
> --  
> ----------------------------------------------------------------------- 
> -
> Stanley Hopcroft
> ----------------------------------------------------------------------- 
> -



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list