Possible patch to cure CGI's not finding data for objects in status.dat

Ethan Galstad egalstad at nagios.org
Fri Jul 31 17:07:12 CEST 2009


Cary Petterborg wrote:
> Our status.dat file is about 37MB. We occasionally will find that
> valid services are not showing up from a status.cgi or extinfo.cgi
> page. This results in people getting confused or they know the
> problem and refresh the page to get the REAL data they need. Since
> the status.dat file is written to a temp file which is moved into
> place once the file is closed, it should not have partial contents.
> But, in our case at least, we were seeing results from the CGI's as
> if the file were only partially written. The problem with the current
> implementation is that it is possible that the file gets closed, but
> the contents are not completely flushed to disk when it is moved into
> replace the old file. In testing this phenomenon I took a service
> from the end of the status.dat file and looked at a CGI page as
> quickly as I could for many iterations. I found that about every 30th
> time (my average) the page acted as if the service didn't exist.
> 
> That seems to be quite a high number of instances for the page to
> fail, so I added an fflush() before the fclose() and an fsync() right
> after the fclose(). This virtually guarantees that the file is
> completely written before the temp file is moved in to replace the
> outdated file. After making the change I was never able to get a
> failed page in more than 200 iterations of viewing the same page.
> 
> The other files that could be a problem (and for completeness sake)
> are retention.dat, comments.dat and downtime.dat. So I applied the
> same principle change to each of these.
> 
> I'm attaching a patch file that was done against our 2.7 version. I
> looked in the 3.0 code and it was not substantially different. The
> line numbers are different, though the context is the same, but the
> patch doesn't work on 3.0. I'm quite sure that a similar fix will
> work properly for 3.0.
> 
> If anyone else is having this problem, you might want to try this
> patch and see if it fixes your problems as well. It is probably a
> good candidate for a bug fix if it is found to be a valuable
> modification. I don't know if smaller installations of Nagios are
> having any issues like this or not, but I suspect it is possible
> since actually flushing to the disk is handled by the OS on it's own
> timetable unless forced with fsync().
> 
> If you try this modification, please let me know of any issues you
> have.
> 
> Cary Petterborg

Good patch - I'll get this applied to Nagios 3.x HEAD.

- Ethan Galstad

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july




More information about the Developers mailing list