Planning code for coping with partial mesh network.

Stephen Schaefer SSchaefer at rfmd.com
Tue Mar 23 18:45:39 CET 2004

Previous message: Passive service checks should sometimes reschedule active checks?
Next message: Planning code for coping with partial mesh network.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

The current Nagios code does not work with a partial mesh network: the
routine that is supposed to check and reject circular paths manages to
call itself infinitely, crashing when it runs out of stack.  I've tried
removing that checking routine, and Nagios monitoring runs fine,
although I haven't encountered what happens when a router fails: the
network I'm monitoring is production, and I haven't had time to build a
virtual network to test with (though it seems one could do a lot with
UML...).  However, the network representation (Status Map) cgi's crash.

The code to fix those is pretty straightforward: the graph is fully
connected, so you just walk the graph as you would a tree, but check
whether you've previously visited each node as you come to it, and
simply don't explore it twice.  That means keeping track of which node
you've been to.  I prefer not to add a mark slot to the node data
structures (which can be hosts, or (host, service)) since there may now,
or in the future, be multiple graph walks in progress concurrently.
Instead, each graph traversal should keep its own mark set.  Each node
has a unique identifier, so the obvious data structure to keep these in
is a hash.

Now, I've seen hashes go into the Nagios code - and come out of it.  In
particular, there was the performance enhancement hash from Daniel
Drown.  In my light reading of the code, it used a couple of global
hashes that were manipulated in a variety of places, and I'm guessing,
since there was no explanation from Mr. Galstad for their removal, that
the mechanism was subject to data corruption.

So, what kind of hashes should I use?  My first thought was to use the
old ndbm routines with NULL as the file name.  I'm building on Red Hat
7.3, and I find that the access functions don't match the manual pages.
I can read the include files, but I'm in a bad mood when I have to do
it.

Recently on this list, hashes from glib were proposed to solve
performance issues similar to those addressed by Mr. Drown.  I
understand that Mr. Galstad may be reluctant to create a significant
dependency on another external library, but is the alternative -
maintaining your own code - really preferable?  Is there a more
appropriate library from which to take hashes?  Something that wouldn't
taint neutrality in the GNOME/KDE jihads, and do we care?  Remember
licensing constraints.

Management has not given this project high priority, so if you're
thinking of doing anything similar, don't wait for me.  Nonetheless, if
I do complete the work, I'd like it to benefit the community, so let me
know now what constraints I should consider such that the result might
be accepted.

    - Stephen P. Schaefer


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id70&alloc_id638&op=click

Previous message: Passive service checks should sometimes reschedule active checks?
Next message: Planning code for coping with partial mesh network.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Developers mailing list