What kind of checks/minute numbers are you getting for single host / non-distributed setups?

Max perldork at webwizarddesign.com
Sat Aug 29 17:32:04 CEST 2009

Previous message: What kind of checks/minute numbers are you getting for single host / non-distributed setups?
Next message: Occasionally "Return code of 127 for check of host/service was out of bounds. Make sure the plugin you're trying to run actually exists."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Ryan,

On Fri, Aug 28, 2009 at 11:31 PM, Ryan Bowlby<rbowlby83 at yahoo.com> wrote:
> Those are impressive numbers for a single Nagios instance. You may be able to tweak out some additional time but you leave the Nagios daemon little room for leeway. What I mean is if two dozen hosts start reporting critical and Nagios starts performing checks at the more aggressive retry_check_interval instead of the normal check_interval, then your check latency is going to go through the roof.

Yes, we have had to balance that already and good point, right now the
majority of our service checks have a 5 normal / 3 retry with 2
retries which has been fine for our users.

> That being said here are some ideas that you may already be trying, but if not may by you some time.
> - switch from check_ping to check_icmp as it's 9x faster in some instances.

We did that, and yes, definitely a noticable improvement there.

> - If any of the client-side nrpe checks are perl, python, etc you may see a decrease in check-time by compiling them. Same for the Nagios server if you aren't already (built-in perl, etc).

All of our checks are either C plugins or ePN-based perl checks :).

> - Often NRPE checks such as those monitoring hardware don't need to be performed as often as say a check_tcp, but since people use templates NRPE frequently gets configured with the same aggressive check_interval as other checks. Scaling back on these will greatly increase the amount of checks the server can do.

Our retry is pretty conservative (60% of normal), good point though,
should see if anyone has configured their checks to have a more
aggressive retry rate (we work on a self-service model where users of
our system can modify their own configs).

> At my work we have 4 remote Nagios instances performing approximately 9400+ checks to our Central Nagios server via nsca. This leaves room for a 400% increase in checks as more departments begin utilizing the monitoring system. Our configs are built by a custom script from our custom dbase and pushed out to the servers via a custom script that keeps everything in cvs. It all works great but took forever to configure. If I had to do it again I would take a serious look at two other options:

Very nice.

> http://dnx.sourceforge.net/ - Crap ton of checks ONE nagios instance!

We are going to try this first for our poller tier and keep
notifications, trending, and trap persistence on a separate tier.
Only concern here is that the remote pollers will not have ePN and I
know how much CPU and load will increase running thousands of checks a
minute in perl without ePN

We have been playing (at the whiteboard level) with the idea of a
persistent script execution proxy/daemon that would allow us to
develop scripts in VHLL languages and the scripts would then be called
with a check command like check_nrpe that just knows how to talk to
the persistent script daemon over a socket .. so, for example, we
could write something based on a JVM that embedded jruby and jython as
well, allowing developers to write checks in java, ruby, or python.
This would also eliminate the problem of having to always do full
restarts of Nagios when ePN is on .. we realize this would not be an
easy thing to implement, but it sounds theoretically very attractive
as a way to allow for decent check performance and abstract the
persistent language daemon outside of Nagios.

> http://www.opsview.org/ - Multiple Nagios instances without writing a slew of custom scripts to do it!

That is definitely well known for performing well and it's stability :).

Thanks a lot for responding, Ryan, appreciate it!

- Max

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Previous message: What kind of checks/minute numbers are you getting for single host / non-distributed setups?
Next message: Occasionally "Return code of 127 for check of host/service was out of bounds. Make sure the plugin you're trying to run actually exists."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Users mailing list