One cause of the 'Internal Server Errors' with nagios 2.3

Bill Ryder bill.ryder.nz at gmail.com
Wed May 17 04:03:40 CEST 2006


HI All,

(First of all - hope the moderator hasn't approved my previous post on
this topic before I resend this one!).


This may or may not apply to anyone having Internal Server Errors but
it certainly fixed my nagios.

Summary:
=======

Make sure your status.dat file (config variable status_file) is on the
same filesystem as your temporary file (config variable temp_file).

In fact I think nagios should enforce this.

If the two files are on different filesystems  and you have a lot of
services and/or hosts you will get this problem intermittently.
Essentially the status.dat file changes underneath the mmap'ed
status.dat file used by many of the cgis.

If you are only monitoriing a few hosts you'll probably get lucky
because the copies take a very short period of time and hence the
window for the fault to occur is very small.

Perhaps nagios's my_rename function should  copy the file to a temp
name in the destination directory then rename the old to a new file if
the filesystems are different.

Long version:
=========

At Weta Digital I have just started using Nagios to monitor our
renderwall (currently around 1,500 machines - 9,000 services). We've
been using nagios for our production servers for years.

I was gettnig the 'Internal Server Error' quite often.

I could easily reproduce the problem by running  status.cgi from a
debug script which looks like:

#!/bin/sh
REQUEST_METHOD="GET"
QUERY_STRING='host=all&servicestatustypes=28'
export QUERY_STRING REQUEST_METHOD
gdb ./status.cgi



I only had to run it about 10-20 times to get a crash like this:

Program received signal SIGBUS, Bus error.
mmap_fgets (temp_mmapfile=0x8600c68) at cgiutils.c:1195
1195                    if(*(char *)(temp_mmapfile->mmap_buf+x)=='\n')

(gdb) p *temp_mmapfile
$3 = {path = 0x8074050 "/var/tmp/nagios_ramdisk/status.dat", mode =
1668573559, fd = 7, file_size = 11087473, current_position = 1570370,
current_line = 69104, mmap_buf = 0xb737c000}
(gdb)

At this point the file had changed size - in otherwords the file
changed under the mmap - which is a recipe for SIGBUS's

I then spent some time trying some different mmap options and thinking
about clever solutions to this and then decided I needed to figure out
exactly what the nagios core does with the  status.dat file. (Which I
should have done to start with of course :-).

This is what i found:

{106} # strace -e trace=file -p 28007 |& grep status.dat
rename("/var/cache/nagios2/nagios.tmp18dGu2",
"/var/tmp/nagios_ramdisk/status.dat") = -1 EXDEV (Invalid cross-device
link)
open("/var/tmp/nagios_ramdisk/status.dat",
O_WRONLY|O_APPEND|O_CREAT|O_TRUNC|O_LARGEFILE, 0644) = 10
rename("/var/cache/nagios2/nagios.tmplQrHE5",
"/var/tmp/nagios_ramdisk/status.dat") = -1 EXDEV (Invalid cross-device
link)
open("/var/tmp/nagios_ramdisk/status.dat",
O_WRONLY|O_APPEND|O_CREAT|O_TRUNC|O_LARGEFILE, 0644) = 10
rename("/var/cache/nagios2/nagios.tmpACHJEk",
"/var/tmp/nagios_ramdisk/status.dat") = -1 EXDEV (Invalid cross-device
link)
open("/var/tmp/nagios_ramdisk/status.dat",
O_WRONLY|O_APPEND|O_CREAT|O_TRUNC|O_LARGEFILE, 0644) = 10

At this point it was obvious.

I had put status.dat on a ramdisk for performance reasons but didn't
move the temp_file. So nagios was creating the new  status.dat (called
nagios.tmp.XXXXXXX)  file in a different filesystem. The rename fell
back to copying the file between filesystems. This was causing the
SIGBUS because the file shrunk underneath the mmaped file.

I don't have these faults anymore now they are both on the same filesystem.

Hope this helps someone!

Bill Ryder
System Engineer
Weta Digital


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list