(Multi-Threaded Nagios results buffer overflowing) A good problem to have?

Steven D. Morrey smorrey at ldschurch.org
Wed Sep 9 19:40:40 CEST 2009


Hi Everyone,

After creating the multi-thread patch for Nagios I noticed that after a few hours my performance would quickly begin to degrade, less and less service checks were executing and yet latency would remain the same.
When I started looking for the problem I noticed that what was happening was that the service result buffer would fill quickly and then would be constantly overflowing.
Used/High/Total Check Result Buffers: 4096 / 4096 / 4096

So I doubled the size of the buffer and it helped but eventually it would overflow again.
Eventually it struck me to tie the size of the buffer to the total number of service checks, so I set it to 30,000.  It worked very well, but after a day or so it would just overflow again.

Used/High/Total Check Result Buffers: 30000 / 30000 / 30000

Then I had an epiphany.  The problem isn't that the buffer is too small, in fact it's really only a symptom.
The actual problem is that the service reaper is too slow.  

Profiling shows that the system spends 2/3rds of it's time just running the reaper and only 1/3rd actually executing checks.
When I moved the high priority events into their own thread, I stopped the reaper from blocking the system, but the reaper still needed that time to actually empty the results buffer.
So I removed the timeout in the reaper that bails out after so many seconds have passed, to give it as much time as it needed, that helped but it still never came close to catching up.

My final solution was to create a thread in handle_timed_event just for the service reaper.  
The reaper is infrequent enough that I don't think thread creation overhead will be a significant issue, but what it does do, is allow more threads and therefore more resources to be devoted to the service reaper when it's needed, and when the results buffer empties the threads can exit freeing resources for other tasks.

The proof is in the pudding.
Used/High/Total Check Result Buffers: 290 / 706 / 30000

As you can see I'm no longer treading the high water mark.

It produces an interesting pattern when running under gdb

[New Thread -1542456416 (LWP 18647)]
[New Thread -1550845024 (LWP 18675)]
[New Thread -1559233632 (LWP 18736)]
[New Thread -1567622240 (LWP 18869)]
[New Thread -1576010848 (LWP 19020)]
[New Thread -1584399456 (LWP 19129)]
[New Thread -1592788064 (LWP 19224)]
[New Thread -1601176672 (LWP 19319)]
[New Thread -1609565280 (LWP 19434)]
[Thread -1601176672 (LWP 19319) exited]
[Thread -1609565280 (LWP 19434) exited]
[Thread -1592788064 (LWP 19224) exited]
[Thread -1542456416 (LWP 18647) exited]
[Thread -1584399456 (LWP 19129) exited]
[Thread -1567622240 (LWP 18869) exited]
[Thread -1550845024 (LWP 18675) exited]
[Thread -1576010848 (LWP 19020) exited]


As you can see the threads appear to be interleaving, for instance even though 18647 is the first one to launch, it's the 4th one to exit. 

Also the number of threads launched is never consistent, I've seen every number from 1 to as many as 20 but I'm sure there is no upper bound, so death by threads is entirely possible here, but unlikely.

I'd appreciate any thoughts you may have on the matter, and maybe some encouragement, advice, and stern words of warning if anyone has been down this path before.  
If I'm treading uncharted waters in undiscovered lands, I'd like to know that as well :)

Sincerely,
Steve


 NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.



------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july




More information about the Developers mailing list