Multi-Threaded Nagios, The story so far...

Steven D. Morrey smorrey at ldschurch.org
Fri Sep 11 23:58:50 CEST 2009


Hi Everyone,

There is a law  of the Universe that says that the more complex something is, the more complex it's problems tend to be, due to complexity of interactions of the disparate systems.
I imagine this is similar in form to chaos theory,  a butterfly flapping it's wings in Africa could trigger a hurricane in Florida etc...
In this case a change I made to speed things up worked a little too well, causing a whirlwind to take place.

As you know a few weeks ago after a lot of profiling I decided to take the high priority event queue and move it into it's own thread.
This worked great for several days but eventually I noticed that the latency would again climb to unacceptable levels and fewer and fewer service checks were executing.

While looking at the log I realized that the results buffer was constantly overflowing so I increased it, and increased it and increased it some more, eventually realizing the problem wasn't that the buffer wasn't big enough, but that it simply wasn't emptying fast enough.
It dawned on me that part of the problem was that the service reaper is serialized, and only has 10 seconds to complete, but on average it takes almost 3 times longer to reap than to execute.
So I modified the reape, removing  the  time limit, but obviously the reaper would be the only high priority event to ever run, since the reaper could never keep up with the executor (dnx).
My final modification to the reaper was to have each reaper event launch into it's own thread.
This created what in effect is a semi-self managed pool of threads since if it takes more than X seconds for the first reaper to finish, a second reaper will launch, X seconds later a third then a fourth and so on.

This design works fantastic except that after the first pass through the system there would be a double free condition and eventually and hours or so later no more checks would be executing.
The event list was empty, but the application didn't exit (in events.c if event_list_low == NULL the program should shut down).

My initial suspicion was that when the double free or corruption would occur it would do so while holding a mutex open, thereby preventing rescheduling from occurring on any events received.
While I never did find out which mutex was causing the problem I was able to eventually able to verify this theory because the application crashed at some point but left 5 process alive,  an strace -ff -p showed that not only was each process stuck waiting on a mutex, they were all waiting on the same mutex.

It's always occuring in the free_memory function, and so I tried to control access to that function via a mutex, however the double free condition continues to occur anyways, so I took a different tact.
Noticing that the double free or corruption issue was occurring on a regular basis in a custom function I created to allow Nagios 2.7 to run host checks concurrent with service checks instead of blocking on them, I went ahead an commented out the problem code and reverted the host check behavior back to normal.  It now seems to be functionally ok, even though we are sporadically getting double free(s) conditions.

I'm going to look into removing the free_memory event all together and have it fire as a high priority event periodically, since it looks to me like it's just basically a garbage collection step anyways.
I'm hoping someone could let me know if I'm on the right track or not.

My numbers aren't quite as good now, it's a 25% drop from the top numbers I was able to get by parallelizing host checks, but it's still working 200-300% faster than it was before I began the work to multi-thread it.
I'm confident that at some point in the future I'll be able to parallelize host checks again, but since 3.x has that already I'm not wasting any more resources on it, we have a roadmap in place to upgrade in the near future anyways.

Assuming it runs stably, I'll get another patch out in a few more days for testing.

Sincerely,
Steve


 NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.



------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july




More information about the Developers mailing list