check_cluster limits

Randy Herban rherban at purdue.edu
Wed Nov 26 00:41:47 CET 2008


We're trying to setup cluster monitoring with Nagios but I think we've
hit a snag.  Our biggest cluster is 890 machines and nagios is
chopping the list at ~316, probably due to a character limit for
check_cluster arguments.

We wrote another script to divide the cluster nodes into smaller
batches to call check_cluster multiple times and tally the results,
but it sums the inputs passed from nagios, it doesn't check the cache
itself.  It looks like an older version was able to specify status.dat
and count directly, but the newest does not.

At the moment, the most promising path looks like I should be writing
a script to parse status.dat and count states by hand, not pretty but
it'll work.  Is there something else I'm missing that might be easier?


For reference, we maintain research computing clusters, several
hundred nodes per cluster are common in our environment.  I'm trying
to monitor each cluster and start pinging our students at certain
thresholds, escalating up to paging admins.


Thanks
-Randy

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list