SIGXFSZ causes nagios to exit silently with nagios 2.9

Ethan Galstad nagios at nagios.org
Tue Aug 21 04:02:43 CEST 2007


John Rouillard wrote:
> Hi all:
> 
> I am seeing the top level nagios daemon exiting shortly after startup
> (after it's first few scheduled service checks are started). When it
> exits it doesn't log anything or does it clear out the status files to
> indicate to the web interface that it has exited.
> 
> When run under gdb I see:
> 
>   Program received signal SIGXFSZ, File size limit exceeded.
>   (gdb) where
>   #0  0x0060a7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
>   #1  0x006dc11b in __write_nocancel () from /lib/tls/libc.so.6
>   #2  0x0068109f in _IO_new_file_write () from /lib/tls/libc.so.6
>   #3  0x0067fafb in _IO_new_do_write () from /lib/tls/libc.so.6
>   #4  0x006807a2 in _IO_new_file_sync () from /lib/tls/libc.so.6
>   #5  0x00675af2 in fflush () from /lib/tls/libc.so.6
>   #6  0x0808f8d9 in xpddefault_update_service_performance_data_file (
>       svc=0x9da19d0) at ../xdata/xpddefault.c:677
>   #7  0x0808f8fc in xpddefault_update_service_performance_data (svc=0x9da19d0)
>       at ../xdata/xpddefault.c:403
>   #8  0x0808e8a1 in update_service_performance_data (svc=0x9da19d0)
>       at perfdata.c:91
>   #9  0x08057b78 in reap_service_checks () at checks.c:1415
>   #10 0x08063790 in handle_timed_event (event=0x9a41ca0) at events.c:1255
>   #11 0x08063e51 in event_execution_loop () at events.c:966
>   #12 0x08053ad5 in main (argc=2, argv=0xbfeead04) at nagios.c:715
> 
> Now I am hitting the 2GB limit on the service perfdata file:
> 
>   [rouilj at ops01 ~]$ ls -lh /var/spool/nagios/tmp/service-perfdata 
>   -rw-rw-r--  1 nagios nagios 2.0G Jun  2 09:21 /var/spool/nagios/tmp/service-perfdata
> 
> (exact size 2147483647 bytes). The file size ulimit on the process is
> unlimited.
>   [rouilj at ops01 ~]$ ulimit -a
>   core file size          (blocks, -c) 0
>   data seg size           (kbytes, -d) unlimited
>   file size               (blocks, -f) unlimited
>   pending signals                 (-i) 1024
>   max locked memory       (kbytes, -l) 32
>   max memory size         (kbytes, -m) unlimited
>   open files                      (-n) 1024
>   pipe size            (512 bytes, -p) 8
>   POSIX message queues     (bytes, -q) 819200
>   stack size              (kbytes, -s) 10240
>   cpu time               (seconds, -t) unlimited
>   max user processes              (-u) 73728
>   virtual memory          (kbytes, -v) unlimited
>   file locks                      (-x) unlimited
> 
> It's a 32 bit kernel i686. uname -a reports:
> 
>   Linux ops01.renesys.com 2.6.9-42.0.10.ELsmp #1 SMP Tue Feb 27 10:11:19
>   EST 2007 i686 i686 i386 GNU/Linux
> 
> I think nagios can handle this case better by:
> 
>   1) Trapping the SIGXFSZ signal so it doesn't exit
>   2) Log an error to nagios.log
>   3) (schedule a) close and reopen of host_perfdata_file and
>      service_perfdata_file allowing the user to rotate the file on command,
>      or re-enable perfdata logging by moving the files aside and
>      having nagios recreate the files.
> 
> 3 is kind of a hack, but there is no signal currently that closes and
> reopens the output files (host_perfdata_file, service_perfdata_file)
> without resetting all of the nagios daemon's internal state.  With 3
> implemented, it is possible to rotate these files without resetting
> nagios's internal state (current scheduled services queue for example)
> on user demand.
> 
> Alternatively the log rotation mechanism currently available for the
> main log file (nagios.log) could be extended to automatically rotate
> and archive these files. I would be happy where all the files were
> rotated/archived on the same schedule as the main log file, but people
> will probably want the following options in nagios.cfg:
> 
>   host_perfdata_rotation_method, service_perfdata_rotation_method:
>      no rotation, hourly, daily, weekly, monthly.
> 
>   host_perfdata_archive_path, service_perfdata_archive_path:
>     move host_perfdata_file, service_perfdata_file to the archive
>     directory with a timestamped extension similar to nagios log file.
> 
> Now this does bring up an interesting question, does anybody have a
> status.dat or retention.dat (or less likely comments.dat or
> downtime.dat) file that is approaching 2GB? What will happen to nagios
> when this limit is hit?
> 
> As an alternative nagios could take the performance hit and use the
> 64-bit file-access and file-locking system calls instead of the
> regular calls for the files where this is liable to be an issue. Hmm,
> can you mix 32 bit and 64 bit file i/o in a single program?
> 
> Since nagios exited on the signal, I just moved the service perfdata
> file aside and restarted nagios to get it operating again.
> 
> 				-- rouilj
> John Rouillard
> ===========================================================================
> My employers don't acknowledge my existence much less my opinions.

Hmm, I've never heard of anyone else with this issue yet, but I guess 
they're rotating perfdata files more often than you are.  Or perhaps you 
are monitoring a *very* large system. :-)

You can use these two config file options to run a command at a 
specified interval to rotate the perfdata logs or do whatever you want.

host_perfdata_file_processing_interval=60
host_perfdata_file_processing_command=somecommand


Ethan Galstad,
Nagios Developer
---
Email: nagios at nagios.org
Website: http://www.nagios.org

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/




More information about the Developers mailing list