Continuing issues with retention file causing schedule/actions to be ignored.

Eli Stair estair at ilm.com
Thu Mar 9 21:23:15 CET 2006


Here I go multitasking, file attached.  I've also attached a day's worth 
of 'premature script header' errors from the apache logs, WRT that 
error.  Here's an example of a view in extinfo.cgi that was working one 
minute, and then after a "refresh" it errors out:

Loading this URL:
   https://monitor02/nagios/cgi-bin/extinfo.cgi?type=1&host=deathstar1258

Results in this error (momentarily):
   [Thu Mar 09 12:08:59 2006] [error] [client 10.73.16.108] Premature 
end of script headers: extinfo.cgi, referer: 
https://monitor02/nagios/cgi-bin/status.cgi?hostgroup=all&style=hostdetail&hoststatustypes=4&hostprops=42

I still haven't been able to get any indication of the cause (or even 
the existence) of the scheduling/event stalling issues.  Nothing ever 
appears "incorrect" in nagios' logs or schedule, only the lack of events 
occuring.  One more item I noticed after I removed the retention.dat 
file yesterday:  In addition to event handlers for one service not being 
executed, there was one user who did not trigger "acknowledgement" 
emails even though it should have, while my ack's sent an email.  After 
the file removal, that problem went away also.  In practice this can 
take several weeks to a month+ of running before I notice the issue 
cropping up again, in that time I add/remove hundreds (thousands) of 
hosts/services, reload and stop/start nagios dozens of times...

Are there any potential fixes for these behaviour in CVS?  I havent seen 
them addressed at all in -devel, while there have been a few reports of 
similar issues.

(Nagios 2.0, x86_64,
7385 services.
754 hosts.
6454 service dependencies.
47 commands.
)


/eli


Eli Stair wrote:
> 
> I'm continuing to have problems when retention.dat file gets into a 
> state where the nagios process stops functioning properly.  The problems 
> I've had in the past were increasing numbers of hosts or entire 
> hostgroups no longer executing their service checks, and now (today) 
> that the event handler for one particular service stopped being executed 
> (while all others continue to work).
> 
> In this and all previous cases, stopping nagios and moving the retention 
> file out of the way resolves the issue.  Reloading or a hard stop/start 
> of nagios doesn't have any effect.  There has never appeared to be 
> anything "wrong" with the retention file.
> 
> The only issues with my installation are this issue, and the 
> all-too-frequent "premature end of script headers" in all the CGI's, and 
> "Warning: Size of service_message struct (528 bytes) is > 
> POSIX-guaranteed atomic write size (512 bytes). " due to compiling 
> x86_64.  That being said, I have enough issues that there dozens of 
> daily "premature script header/Internal Server Error" wreaking havoc 
> with production, and these instances of event failures that are 
> extremely critical.  The script header problem came into being 
> immediately upon upgrading from 2.0b6 to 2.0rc2+, and the 
> scheduling/retention problem has been present to varying degrees in 
> every 2.0b+ I've tried.
> 
> I am happy to find these are configuration/optimization issues on my end 
> I can resolve, but my suspicion is they are bugs.  I will do anything I 
> can to help provide a debug testbed for identifying and tracking them 
> down.  Attached is my main nagios config (objects are not included), and 
> I can provide any other data (object configs, logs, retention.dat, etc) 
> privately if needed (security concerns).
> 
> Please let me know what I can do to help address this and find a 
> resolution.
> 
> Regards,
> 
> /eli
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting language
> that extends applications into web and mobile media. Attend the live 
> webcast
> and join the prime developer group breaking into this new coding territory!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
> 

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nagios.cfg
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20060309/fb61431e/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nagios.script_header_errors
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20060309/fb61431e/attachment-0001.ksh>


More information about the Developers mailing list