Alleviating Nagios i/o contention problem

Max perldork at webwizarddesign.com
Sat Sep 25 17:53:16 CEST 2010


I like the suggestions Matthias makes; those suggestions have worked
well for us.

RRD updates are very expensive - I am pretty sure without knowing
anything more about your system that the RRD writes are causing most
of the I/O load.

Our current largest Nagios-based system has around 7500 hosts and
around 40k active services spread across 4 pollers - the four pollers
send perfdata to two report servers that do nothing but host the trap
databases for traps from SNMPTT from the pollers, RRD files / PNP web
UI, and the server side of our client/server notification system.  The
snmptt dbs and notification server dbs are replicated master master
between the two hosts.  Even with rrdcached and raid 10 these hosts
regularly have 3 - 10 pct I/O wait.

We hope to lower that number a bit by moving the DBs onto separate
dedicated DB hosts.

- Max


On 9/25/10, Matthias Flacke <matthias.flacke at gmx.de> wrote:
> On 9/25/10 2:30 PM, Frost, Mark {PBC} wrote:
>> Greetings, listers,
>>
>>
>>
>> We've got an on-going issue with i/o contention.  There's the obvious
>> problem that we've got a whole lot of things all writing to the same
>> partition.  In this case, there's just one big chunk of RAID 5 disk on a
>> single controller so I don't believe that making more partitions is
>> going to help.
>>
>>
>>
>> On this same partition we have:
>>
>>
>>
>> 1) Nagios 3.2.1 running as the central/reporting server for a couple of
>> other Nagios nodes that are sending check results via NSCA.
>> Approximately 6-7K checks.
>>
>>
>>
>> 2) pnp4nagios 0.6.2 (with rrd 1.4.2) writing graph data.
>>
>>
>>
>> There's a 2nd server configured identically to the first that's acting
>> as a "hot spare" so it also receives check data from the 2 distributed
>> nodes and writes its own copy of the graph data locally as well.
>>
>>
>>
>> At the moment I'm concerned about the graphdata, but because I can only
>> see i/o utilization as an aggregate, I can't tell what is the worst
>> component on that filesystem -- status.dat updates?  graph data?  writes
>> to the var/spool directory?  We also look at continued growth so this is
>> only going to get worse.
>>
>>
>>
>> These systems are quite lightly loaded from a CPU (2 dual-core CPUs) and
>> memory (4GB) perspective, but the i/o to the nagios filesystem is
>> queuing now.
>>
>>
>>
>> We're about to order new hardware for these servers and I want to make a
>> reasonable choice.  I'd like to make some reasonable changes without
>> requiring too exotic of a setup.  I believe these servers are currently
>> Dell 2950s and they're all running Suse Linux 10.3 SP2.
>>
>>
>>
>> My first thought was to potentially move the graphs to a NAS share which
>> would shift that i/o to the network.  I don't know how that would work
>> though and it would ultimately be an experiment.
>>
>>
>>
>> What experiences do people out there have handling this kind of i/o and
>> what have you done to ease it?
>
> You didn't say how many of your checks create perfdata - but I assume
> that most of your disk I/O is related to RRD updates.
> RRD cached (see http://docs.pnp4nagios.org/pnp-0.6/rrdcached for PNP
> integration) is a good means to collect multiple RRD updates and burst
> write the RRD files.
>
> status.dat and the checkresults directory are always good candidates to
> be stored on a ramdisk, especially since they're volatile data. As a
> side note: status.dat on ramdisk is a pure boost for the CGIs :).
> I know people which also store nagios.log on a ramdisk and regularily
> save them via rsync onto a hard disk.
>
> My own systems with ~4000 checks and ~20.000 performance relevant data
> sets went down from 30% to less than 2% wait I/O with rrdcached and
> ramdisk use.
>
> Cheers,
> -Matthias
>
> ------------------------------------------------------------------------------
> Start uncovering the many advantages of virtual appliances
> and start using them to simplify application deployment and
> accelerate your shift to cloud computing.
> http://p.sf.net/sfu/novell-sfdev2dev
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting
> any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list