Possible patch to cure CGI's not finding data for objects in status.dat

Cary Petterborg PetterborgCa at ldschurch.org
Fri Aug 7 01:15:42 CEST 2009


In response to your request for details of our system: We are running SuSE 9 writing to a Rieser FS (with a separate web server reading the status.dat, etc. from an NFS mount off the main Nagios server). Our status.dat file is 37MB, and objects.cache is 32MB. If you need more details than this, please let me know what you need.

In my test (and they are *not* extensive), the web server would occasionally get incomplete results from the status.dat file. The likelyhood of this increased if the service information was at the end of the status.dat file, though not exclusive to that positioning. After the implementation of this change, 300 reloads of the extinfo.cgi page for the entry of the service found at the end of the status.dat file completed everytime. That seemed to be enough anecdotal information to post it as a possible patch. I'm glad to see that it was looked at and is being scrutinized.

I may be wrong in this next information, but I did homework on it before proceeding to try to implment the fix on our system, and I'm taking the information from what I found. The fsync() call is the more important function call in the fix. fclose() almost always guarantees fflush(), but it doesn't guarantee that it will be written to the disk immediately, especially if the program doesn't exit. fflush() asks the OS to flush the output to the disk, but it will do it at the OS level, meaning it may wait momentarily to do so. fsync() does incur a very slight perfomance hit, but it is not like sync() (which a user program should not call). fsync() has much less an impact than sync(). Since *we* are reading the file across NFS, that may be the reason we are seeing the absense of file data. Sin
 ce the data is written to a temporary file, then renamed to replace the previous version, there isn't much chance for the complete file not to be available. 

Can you provide another explanation of why the status.cgi and extinfo.cgi programs are failing to find the data for a host or service one second, but succeeding a few seconds later if not that the status.dat file, etc. do not contain the information? We would seriously like to fix this problem.

Thanks!

Cary

________________________________________
From: Gaspar, Carson [Carson.Gaspar at gs.com]
Sent: Thursday, August 06, 2009 1:18 PM
To: 'Nagios Developers List'
Subject: Re: [Nagios-devel] Possible patch to cure CGI's not finding data for objects in status.dat

Really? This makes no sense at all.

All pending stdio output should be flushed by fclose(). If it isn't, your stdio is broken.

All pending disk writes will read back as if committed when read on the same host, without needing a very expensive fsync(). If it isn't, then your kernel / filesystem is broken.

Please do _not_ add this code change. If there's a real bug in Nagios, this doesn't fix it, just hides it. And if the bug is in the OS, working around it isn't the right answer (unless you want to add checks for brokenness to autoconf).

Cary, can you please provide details of the system on which you are experiencing the problem?

-----Original Message-----
From: Ethan Galstad [mailto:egalstad at nagios.org]
Sent: Friday, July 31, 2009 8:07 AM
To: Nagios Developers List
Subject: Re: [Nagios-devel] Possible patch to cure CGI's not finding data for objects in status.dat

Cary Petterborg wrote:
> Our status.dat file is about 37MB. We occasionally will find that
> valid services are not showing up from a status.cgi or extinfo.cgi
> page. This results in people getting confused or they know the
> problem and refresh the page to get the REAL data they need. Since
> the status.dat file is written to a temp file which is moved into
> place once the file is closed, it should not have partial contents.
> But, in our case at least, we were seeing results from the CGI's as
> if the file were only partially written. The problem with the current
> implementation is that it is possible that the file gets closed, but
> the contents are not completely flushed to disk when it is moved into
> replace the old file. In testing this phenomenon I took a service
> from the end of the status.dat file and looked at a CGI page as
> quickly as I could for many iterations. I found that about every 30th
> time (my average) the page acted as if the service didn't exist.
>
> That seems to be quite a high number of instances for the page to
> fail, so I added an fflush() before the fclose() and an fsync() right
> after the fclose(). This virtually guarantees that the file is
> completely written before the temp file is moved in to replace the
> outdated file. After making the change I was never able to get a
> failed page in more than 200 iterations of viewing the same page.
>
> The other files that could be a problem (and for completeness sake)
> are retention.dat, comments.dat and downtime.dat. So I applied the
> same principle change to each of these.
>
> I'm attaching a patch file that was done against our 2.7 version. I
> looked in the 3.0 code and it was not substantially different. The
> line numbers are different, though the context is the same, but the
> patch doesn't work on 3.0. I'm quite sure that a similar fix will
> work properly for 3.0.
>
> If anyone else is having this problem, you might want to try this
> patch and see if it fixes your problems as well. It is probably a
> good candidate for a bug fix if it is found to be a valuable
> modification. I don't know if smaller installations of Nagios are
> having any issues like this or not, but I suspect it is possible
> since actually flushing to the disk is handled by the OS on it's own
> timetable unless forced with fsync().
>
> If you try this modification, please let me know of any issues you
> have.
>
> Cary Petterborg

Good patch - I'll get this applied to Nagios 3.x HEAD.

- Ethan Galstad

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


 NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.



------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july




More information about the Developers mailing list