[opennms-discuss] Which tool is best for me: Nagios, OpenNMS, or something else?

Kelly Jones kelly.terry.jones at gmail.com
Sat Mar 31 20:09:53 CEST 2007


I have ~25 computers worldwide, and want to run several test commands
on them every x minutes (where x is configurable), report the results,
and alert me when "bad things" happen.

Is this what Nagios/OpenNMS is designed for? Specifics below, but are
Nagios/OpenNMS designed for network testing or more general multi
system status monitoring/reporting/alerting?

More specifically:

% Confirm each machine is up/pingable/reachable [obviously!]

% nmap each machine to make sure correct ports (varies by machine) and
no others are open

% "Egress" testing: machines can/can't reach port 25/80/443/etc (my
choice for each machine) on public machines like www.yahoo.com,
smtp.yahoo.com, etc.

% Config changes: not just hacking/cracking testing, but maybe I
intentionally upgraded something, and accidentally broke email between
two machines or broke Mailman auto-reply or something like that.

% Not all tests all the time: some tests should run less frequently
(reduce the load); ideally, some "silly" tests run "randomly" to check
on things I'm 100% positive sure will work all the time (but really
fail sometimes-- eg, loopback interface broken)

% For machines running httpd, download several pages, diff to last
copies of these pages, report "big" differences (I assume small diffs
are changes, but big diffs may be hacking/defacing/config error/etc);
for "status update" pages (like mrtg), pages *should* be changing
frequently, otherwise something is wrong

% For machines running sendmail, send a test email to one of the other
machines running sendmail, which then confirms receipt; alert if not
received. Also do other mail routing/delivery tests.

% For machines running MySQL, run specific queries that are expected
to return certain results (two basic types of queries: some that
should NEVER return ANY results, others that should ALWAYS return AT
LEAST ONE result)

% For machines running popd/imapd, simulate login to confirm
authentication is working (popd/imapd auth isn't always local for us)

% For machines running bind/tinydns, DNS testing: responding w/
correct IP addresses, ACL control re who can lookup which hosts?
(internal machines can look up yahoo.com, external machines can look
up *.mycompany.com only)

% For machines running other daemons, do daemon specific testing/verification

% Backup testing: our backup server should have fairly recent copies
of the files it's backing up from other machines

% Replication testing: confirm that a config file/database/software
version/etc is the same on two given machines

% Monitor files in /etc (eg, passwd, shadow, crontab) for changes.

% Ping the other 24 machines + alert me if ping fails or is very slow

% Firewall rule testing: test which machines can reach which other
machines on what ports and compare to known good list/table

% Cross-report status of each machine to each other machine, even if
nothing is wrong (so each machine knows how the other 24 are doing)

% Run things like "df -k/df -ik", "ps -aux -www", "top -n -d 1
infinity", "netstat -a", "vmstat", "mailq -v", "uptime", etc, and find
memory-hogging/CPU-intensive processes, non-daemon processes that have
been running for a long time, processes running w/o proper
subprocesses, non-listening daemon processes (like ntpd) not running,
near-full disk partitions, DOS attacks, many sockets in FIN_WAIT_[12]
state, overfull mail queues, recent reboots, etc.

% Allow for special cases: run specific tests (eg, a Perl script) on
only 1-2 of the 25 machines

% Windows machines: ideally, run the equivalent of the commands above
and also report failed scheduled tasks, near-full Exchange stores, and
other Windows-specific issues

% Ideally, a lot of the above tests should run "out of the box" or
Nagios/OpenNMS could run some sort of "discovery" program (find out
which machines are running httpd and grab a few pages linked off the
home page + use those as the "test" pages from then on), and allow me
to customize as necessary. Of course, I realize I'll have to config
things like custom SQL queries.

% Ideally, the testing should be "decentralized": any of the machines
can test any of the others, and the results are stored in a
distributed/mirrored way. However, the testing management is ideally
"centralized" in the sense that I can control testing on all machines
from a given machine.

% Ideally, the results (good or bad) can be displayed in a web page so
my customers can see that my machines are being tested regularly, and
are up and running fine as of x minutes ago.

% Ideally, the "something bad has happened" reporting can be
configured-- it may be OK for "mailq -v" to be large for 10-15
minutes, but not for 30 minutes (for example).

% Ideally, software-specific "regression" testing. EG, when I upgrade
Mailman, sendmail, etc, Nagios/OpenNMS could run a set of tests to
make sure I didn't break something horribly

-- 
We're just a Bunch Of Regular Guys, a collective group that's trying
to understand and assimilate technology. We feel that resistance to
new ideas and technology is unwise and ultimately futile.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss





More information about the Users mailing list