Well, I have more information to add. I found a script that was being launched at midnight to purge old data from the database. The tables being pruned are used by perfparse to store perfdata and the like. They have > 180M rows, are 30-60GB, and are actively being inserted into all the while. As I understand it, they are InnoDB and should be using row (not table) locks, and really should not have much trouble with concurrent inserts. While this goes on, one CPU/core is largely in iowait, but the other 7 are largely idle, and we generally don't have any trouble with RAM or other resource exhaustion. Now that I know what caused my problem, I can reproduce it, which is ... interesting. After only a few minutes, nagios starts falling behind on service checks. It appears to be getting new checks with current timestamps in the nagios.log, but a service detail sorted by "Last Check" descending slowly shows the timestamps getting further and further behind current. A bit later, nagios starts taking 100% of 2 CPU cores, and nsca processes start to stack up... leading to the problem as I was observing it in the morning. In an attempt to diagnose I tried a few things. I have found that by the time nagios starts to bug out it can't be saved. If you cancel the delete query after seeing a lag on the check results, it does not slowly improve, and 'catch up' as I had hoped. This happens even if there are no rows to be deleted, though not if you use LIMIT to keep the query to a reasonable timeframe. I'm still looking for fresh ideas, but in the meantime I am writing a script to loop over the delete and do it in 10,000 row increments which are ~10 seconds instead of ~3M rows which takes over an hour per table. If you do the math, though, you'll see it'll be nearly as time-consuming, and I'm just hoping that we'll lock whatever is going on for a shorter period with room for inserts to happen in-between. Even if that 'fixes' it, I won't be satisfied. Any and all suggestions are welcomed. --Rick <div class="gmail_quote">On Fri, Jan 29, 2010 at 11:01 AM, Rick Mangus <<a href="mailto:rick.mangus%2Bnagios@gmail.com">rick.mangus+nagios@gmail.com</a>> wrote: <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hello, all. Forgive me, I am new to the list, and have only begun working with nagios recently. I have searched this list and googled furiously with little result, so must cease my lurking and present my problem to you. I will begin with the problem: Sometime after midnight every night, my nagios server starts to have trouble processing service checks. I don't know the cause, and cannot find a solution. I can describe the symptoms in detail and hope we can diagnose it. The web interface shows the last service check came in at 02:28:34 (EST). I know that around 4:15 every morning, xinetd starts refusing connections to nsca due to high load (max_load is 18), and that eventually I will have 32000+ nsca connections using up all available PIDs leading to an inability to fork new processes, effectively killing the machine. While all this happens, the nagios.log appears to periodically stall, making no new entries for 15 minutes at a time, and then flush 15000 in the space of a single second. Also, it seems the checkresults directory is empty most of the time, but sometimes pops up to 2045 files (it's on a ramdisk with 2048 inodes) and not a single one gets deleted in a time period I have been patient enough to observe. The periods in which the nagios log is going nowhere are accompanied by nagios taking 100% of 2 CPUs. One thread appears to poll() approximately every 25 usecs, and another is inscrutable, with mprotect() the only strace-visible syscall. All the nsca processes have a blocking write() they are waiting on. When the log is showing new entries, there are still no updates made to the services, and it seems that that is what is filling up checkresults. I admit I have not checked to find the order of the log and checkresults processes, though I assumed they would operate in the opposite order of what this appears to show. I know this behavior has been ongoing for at least 1 month. I have disabled all cron jobs that I feared might be interfering. I will answer any and all questions to the best of my ability, and hope someone here can shed some light on the situation. --Rick </blockquote></div>