dynamic checks

Joel Brooks jbrooks at oddelement.com
Fri Jul 16 17:08:22 CEST 2010


Hi all,

This is a long post, but (I think) it's an interesting problem...

I'm struggling a little with a check procedure I'm trying to create.  I'm a
long time nagios user and I *know* there's a way to do this, but I'm having
problems wrapping my head around the best way to achieve the desired effect.

I have a data center that i monitor with nagios.  I have a number of
database servers (about 40) with a number of customers (about 500).

I have a number of application checks that I need to do on these databases -
things like a data import queue, number of active threads, etc.  These
checks need to be run for each database and there are a number of checks to
perform for each (about 10).  In other words, there's about 10 checks x 500
customer database checks (5,000) checks in total.

The first problem is that I need to move a customer's database from one
server to another from time to time for various reasons (capacity /
performance, etc).  This means all the checks for that customer database
have to move from one database host to another.

What I've done so far is to create dynamic service checks for each of these
application counters.  The checks do the following:

execute a query against the current host's master database and retrieve a
list of customer database instances on this host.
for each database, query the relevant application counter.
If there are any problems (warn or crit thresholds surpassed), the check
returns warn or crit and lists only the databases that are in trouble.

e.g.: WARN: at least one database is in trouble.  \n  Customer1: import
queue is > 500.

OK, so far so good.  When I move a customer database, the check on the old
server just doesn't get that customer's database in the list anymore and the
check on the new server begins checking it.  Great.  First problem solved.

Now the tricky part.  I'm using PNP to graph performance data.  The check
script described above returns a LONG perfdata string with perfdata for each
database.  the way PNP works, it creates one big RRD file for each check -
in other words, it creates one rrd file with data sources for each customer
database on that server.  When a customer database moves to a new database
server, the rrd file is not - can not be - updated, so the perfdata just
stops for that customer database.   It is not easy to move a data source
from one rrd file to another, so i have a conundrum.

One way to fix this is to simply create a check on nagios for each customer
database.  If a database moves, just delete the check on the old server and
create it on the new server.  move the relevant rrd file to it's new home
under the new server's PNP directory and we're done.  but that means
maintaining 5000 or so check commands on the nagios server and all the
associated overhead of running so many checks.  The way I have it now,
there's only 10 * 40 (400) checks - which is much more manageable.

I looked at using check_multi, but it suffers the same problem - the
perfdata is returned for all child checks in one perfdata string.

What I need is a way to dynamically build service checks for each database
server.  I'm thinking about a check command does:

for each db in `cat db-server-host-name.txt`; do
  check_nrpe -H db-server-host-name -c check_app -a customer1
done

but I'm not sure how this would work in terms of nagios service check
definitions.  One possibility is to use another script outside of nagios
that does something like:

in nagios.cfg:
cfg_file=/etc/nagios/database-checks.cfg

for each serverin `cat db-servers.txt`; do
  ' query master database, get a list of customer databases
  ' dump list to db-server-host-name.txt
done

for each db in `cat db-server-host-name.txt`; do
  ' modify database-checks.cfg
  ' create service definitions for each database on each database server
done
service nagios reload

or something like that.

I'm open to suggestions if anyone has a better way to do this... maybe i'm
over-complicating this - i have been buried in this conundrum for a few days
and may not be seeing the trees anymore... :/

Thank you all!

J
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20100716/415f7b92/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list