Nagios 3.0.5 problem

Rick Mangus rick.mangus+nagios at gmail.com
Wed Feb 3 21:49:37 CET 2010


Well, I have more information to add.

I found a script that was being launched at midnight to purge old data from
the database.  The tables being pruned are used by perfparse to store
perfdata and the like.  They have > 180M rows, are 30-60GB, and are actively
being inserted into all the while.  As I understand it, they are InnoDB and
should be using row (not table) locks, and really should not have much
trouble with concurrent inserts.  While this goes on, one CPU/core is
largely in iowait, but the other 7 are largely idle, and we generally don't
have any trouble with RAM or other resource exhaustion.

Now that I know what caused my problem, I can reproduce it, which is ...
interesting.  After only a few minutes, nagios starts falling behind on
service checks.  It appears to be getting new checks with current timestamps
in the nagios.log, but a service detail sorted by "Last Check" descending
slowly shows the timestamps getting further and further behind current.  A
bit later, nagios starts taking 100% of 2 CPU cores, and nsca processes
start to stack up...  leading to the problem as I was observing it in the
morning.

In an attempt to diagnose I tried a few things.  I have found that by the
time nagios starts to bug out it can't be saved.  If you cancel the delete
query after seeing a lag on the check results, it does not slowly improve,
and 'catch up' as I had hoped.  This happens even if there are no rows to be
deleted, though not if you use LIMIT to keep the query to a reasonable
timeframe.

I'm still looking for fresh ideas, but in the meantime I am writing a script
to loop over the delete and do it in 10,000 row increments which are ~10
seconds instead of ~3M rows which takes over an hour per table.  If you do
the math, though, you'll see it'll be nearly as time-consuming, and I'm just
hoping that we'll lock whatever is going on for a shorter period with room
for inserts to happen in-between.  Even if that 'fixes' it, I won't be
satisfied.

Any and all suggestions are welcomed.

--Rick

On Fri, Jan 29, 2010 at 11:01 AM, Rick Mangus
<rick.mangus+nagios at gmail.com<rick.mangus%2Bnagios at gmail.com>
> wrote:

> Hello, all.
>
> Forgive me, I am new to the list, and have only begun working with nagios
> recently.  I have searched this list and googled furiously with little
> result, so must cease my lurking and present my problem to you.
>
> I will begin with the problem: Sometime after midnight every night, my
> nagios server starts to have trouble processing service checks.  I don't
> know the cause, and cannot find a solution.  I can describe the symptoms in
> detail and hope we can diagnose it.
>
> The web interface shows the last service check came in at 02:28:34 (EST).
> I know that around 4:15 every morning, xinetd starts refusing connections to
> nsca due to high load (max_load is 18), and that eventually I will have
> 32000+ nsca connections using up all available PIDs leading to an inability
> to fork new processes, effectively killing the machine.  While all this
> happens, the nagios.log appears to periodically stall, making no new entries
> for 15 minutes at a time, and then flush 15000 in the space of a single
> second.  Also, it seems the checkresults directory is empty most of the
> time, but sometimes pops up to 2045 files (it's on a ramdisk with 2048
> inodes) and not a single one gets deleted in a time period I have been
> patient enough to observe.
>
> The periods in which the nagios log is going nowhere are accompanied by
> nagios taking 100% of 2 CPUs.  One thread appears to poll() approximately
> every 25 usecs, and another is inscrutable, with mprotect() the only
> strace-visible syscall.  All the nsca processes have a blocking write() they
> are waiting on.  When the log is showing new entries, there are still no
> updates made to the services, and it seems that that is what is filling up
> checkresults.  I admit I have not checked to find the order of the log and
> checkresults processes, though I assumed they would operate in the opposite
> order of what this appears to show.
>
> I know this behavior has been ongoing for at least 1 month.  I have
> disabled all cron jobs that I feared might be interfering.  I will answer
> any and all questions to the best of my ability, and hope someone here can
> shed some light on the situation.
>
> --Rick
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20100203/972a4bfe/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list