Nagios 3.0.5 problem

Rick Mangus rick.mangus+nagios at gmail.com
Fri Jan 29 18:01:32 CET 2010


Hello, all.

Forgive me, I am new to the list, and have only begun working with nagios
recently.  I have searched this list and googled furiously with little
result, so must cease my lurking and present my problem to you.

I will begin with the problem: Sometime after midnight every night, my
nagios server starts to have trouble processing service checks.  I don't
know the cause, and cannot find a solution.  I can describe the symptoms in
detail and hope we can diagnose it.

The web interface shows the last service check came in at 02:28:34 (EST).  I
know that around 4:15 every morning, xinetd starts refusing connections to
nsca due to high load (max_load is 18), and that eventually I will have
32000+ nsca connections using up all available PIDs leading to an inability
to fork new processes, effectively killing the machine.  While all this
happens, the nagios.log appears to periodically stall, making no new entries
for 15 minutes at a time, and then flush 15000 in the space of a single
second.  Also, it seems the checkresults directory is empty most of the
time, but sometimes pops up to 2045 files (it's on a ramdisk with 2048
inodes) and not a single one gets deleted in a time period I have been
patient enough to observe.

The periods in which the nagios log is going nowhere are accompanied by
nagios taking 100% of 2 CPUs.  One thread appears to poll() approximately
every 25 usecs, and another is inscrutable, with mprotect() the only
strace-visible syscall.  All the nsca processes have a blocking write() they
are waiting on.  When the log is showing new entries, there are still no
updates made to the services, and it seems that that is what is filling up
checkresults.  I admit I have not checked to find the order of the log and
checkresults processes, though I assumed they would operate in the opposite
order of what this appears to show.

I know this behavior has been ongoing for at least 1 month.  I have disabled
all cron jobs that I feared might be interfering.  I will answer any and all
questions to the best of my ability, and hope someone here can shed some
light on the situation.

--Rick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20100129/65046dfe/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list