Sorry for replying to myself. I sent that just a little too soon.<br><br><div class="gmail_quote">On Tue, Aug 16, 2011 at 5:26 PM, Adam Augustine <span dir="ltr"><<a href="mailto:augustineas@gmail.com">augustineas@gmail.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="gmail_quote"><div class="im">On Mon, Aug 15, 2011 at 7:25 AM, Andreas Ericsson <span dir="ltr"><<a href="mailto:ae@op5.se" target="_blank">ae@op5.se</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> <div>On 08/09/2011 09:13 PM, Adam Augustine wrote:<br> ><br> > But in spite of that, it seems that moving the reaper code into a thread<br> > would be generically useful for Nagios. I know it has been discussed on this<br> > list in the past.<br> ><br> <br> </div>It would also cause a bunch of problems. What we're working on instead is<br> implementing worker processes which communicate with a master process via<br> a unix socket. One such process could act as a (mostly dormant) reaper for<br> the checkresult files in the spool directory.<br> <div><br></div></blockquote></div><div><br>Ah, it seems the scope of the worker process socket effort is much larger than I had expected. Does this mean that modules that were initially NEBs can instead be implemented as wholly independent processes, communicating back over that socket (presumably more than just a unix domain socket, but also a network socket as well)?<br> <br><br></div></div></blockquote><div><br>Here I meant that generically. The context of the original thread on the dev list regarding the socket communication to the master process led me to believe that it was specifically about offloading checks (ala DNX and mod_gearman). The question I am asking is whether all NEB callbacks would be implemented over the socket communication in the future.<br> <br> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="gmail_quote"><div> </div><div class="im"><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> <div> > If the Merlin reaper thread is wholly contained within the Merlin NEB (as it<br> > appears to be) and is not in any way patching the Nagios core code, then my<br> > question is, how is that working without conflicting with the main event<br> > loop reaper code?<br> <br> </div>Mainly by making Nagios itself threadsafe all API's the broker module uses.<br> That's why Merlin needs Nagios 3.3.1 or one of the post-3.2.3 versions made<br> available through <a href="http://git.op5.org" target="_blank">git.op5.org</a><br> <div><br></div></blockquote></div><div><br>Ah, so there are modifications necessary to pre-3.3.1 versions of Nagios to override the reaping process. Nagios 3.3.1 now has real (and threadsafe) APIs for manipulating internal data structures, where before there weren't any. This makes perfect sense to me. The Merlin reaper thread uses the same API to update the in-memory data structures that the main event loop reaper code would, so no conflicts.<br> </div><div class="im"><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> <div> > My quick glance at the NEB callbacks for<br> > EVENT_CHECK_REAPER seems to indicate that there isn't any<br> > NEBERROR_CALLBACKOVERRIDE associated with it. So I am very curious how it is<br> > being handled.<br> ><br> <br> </div>You're talking about two different reapers. They don't interfere with<br> each other at all.<br> <font color="#888888"><br> --<br> Andreas Ericsson <a href="mailto:andreas.ericsson@op5.se" target="_blank">andreas.ericsson@op5.se</a></font><br></blockquote></div></div><br>I think I understand now, presuming that the Merlin reaper and the main Nagios event loop reaper are both using the new thread safe APIs.<br> <br>But I am still a little confused. You mention above that implementing the reaper code as a Nagios thread would cause a lot of problems, but isn't that what the Merlin NEB module does? Are you encountering a lot of problems with that approach? Or was it specifically the /moving/ the reaper into a thread that you thought was a bad idea?<br> <br>I certainly agree that socket communication provides a much cleaner separation, and would make things easier, and I am not advocating <br><br></blockquote><div>sticking with the NEB callback model.<br><br> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> But separate thread or separate process is really an implementation detail (an important one, admittedly, but still).<br> <br>My base assumption is that, the single threaded nature of Nagios core is slowed significantly by the time spent in the reaping portion of the loop. Evidence supporting that assumption is the fact that we have a timeout associated with that portion of code. Assuming the default of 30 seconds is "sane" then the reaper could spend up to 30 seconds blocking checks from being executed, and significantly impacting check_latency.<br> <br>Anyway, for a larger number of checks (50K-100k), I would think a reaper implemented as a worker process (or thread, or whatever) would be very busy processing all the results coming into the checkresults files in the spool directory and updating the relevant in-memory data structures. But based on your statement above (the "mostly dormant" part), it would seem that I am wrong somewhere.<br><br>What am I missing?<br><br>Thanks for your time in answer my questions. I have spent some time looking through the code and usually end up with more questions than answers on the internals of how Nagios is handling things.<br> <br> </blockquote></div><br>