New Nagios implementation proposal

nap naparuba at gmail.com
Tue Dec 1 18:09:56 CET 2009


Hi list,

I would like to have your feed back about a (unfinished)
reimplementation of Nagios named "Shinken" I wrote in Python that is
faster and more modular than the current Nagios implementation in C
(yes faster, you read correctly. I was the first surprised by that).

== The Shinken's history ==
Few months, I start to work on a proof of concept for Nagios focus on
distributed environments and performances. The main goal was to look
for a distributed and high availability architecture. I was also
thinking that Nagios' performances were quite good, but we can have
more.

For quick test and development, I used Python. I thought a process
pool can make Nagios be quicker instead of forking a new process to
kill it few seconds after for each checks. I also bypass the reaping
way of Nagios : reading flat file is just too slow. Instead, the
results are a structure that is send directly to the scheduler. No
files, more performances. To be equal to Nagios, I add the same
monitoring logic in the scheduler : HARD/SOFT states, dependencies
(parents, servicedep, hostdep, etc) and database export (Merlin).
Shinken used the standard Nagios conf file.

And the perf are quite good : with a Nagios3,  a small check (do a
echo + exit) and a medium range server I run at 10000 checks in
5minutes (latency near 1s), 30K with full tweaks. With my tool, I run
150K !!


== The global architecture ==
For the Architecture, I think we must use the Unix Way of doing things
: one tool by usage. For now, Nagios do nearly every things : reads
conf, schedule, launch checks and raise notifications. I try an
architecture where the administrator can have any host/services he
wants and the daemons are just resources to manage this. The
architecture I propose is the following :
*Arbiter : a daemon that read the configuration, cut it automatically
(keep relations like parents in the same conf) in N confs, where N is
the number of schedulers we have. It dispatchs the configuration and
also read the orders in nagios.cmd and dispatch orders to schedulers.
*Schedulers : do the scheduling by looking at states of
hosts/services. It just do checks/notifications/event handlers queues
for others daemons. Same things for event broker informations : it's
just a queue.
*pollers : use a processes Pool, get checks to launch in schedulers
and returns results to schedulers.
*reactionners : same than pollers, but for notifications and event handlers.
*brokers : get event broker informations from schedulers and "do
things" why them (like create the service-perfdata file, or fill
databases).

The poller way of doing is like DNX, nothing new here. The
reactionners allow the administrators to have a unique daemon to send
all notifications of all his schedulers (usefull for SMTP
authorizations or the fill of a unique RSS file with all
notifications). The schedulers do not launch checks, so they do not
get latency when they launch notifications or event handlers.

The load balancing is automatic : the arbiter cuts the conf and
dispatch thems. For the high availability : there can be spare daemons
: if a daemon die, another take it's configuration (the Arbiter "ping"
daemons, and if a daemon failed, it just send the configuration to a
spare). The daemon are reach by network, so all daemons can be in
different servers (and it's better for high availability to not put
all daemons in the same server :) ). For now, the Arbiter do not have
a spare, but it will be add in the future.

You can see this Architecture in the file shinken-architecture.png.

If the user configuration do not defined such daemons, Shinken
automatically create defaults one (in localhost with default ports).

== Advanced architecture ==
In the architecture we saw, all reactionners/pollers/brokers take
orders from ALL schedulers. It can be a problem with reactionners
(with 3 SMTP servers (USA, Europe, Asia), it's hard to forced Asia
notifications to go in the Asia SMTP server). Same for poller : it
polls checks to run, and get checks from a very distant scheduler can
be very slow.
To manage this, Shinken use a way of cutting the architecture : Realms.

A realm is a pool of daemons that work togethers. A host is tag with a
realm (and only one) so it will be managed by this realm's
schedulers/pollers/reactionners/brokers. A realm can have sub-realms
so you can put a reactionners in the higher Realm and it will managed
all schedulers of sub-realms. A picture is worth a thousand words. You
can have a better look of what realm is in the file
shinken-architecture-global-realm.png.

Same for daemons : if the user configuration do not defined realm, a
default one is created by Shinken.

== What is not managed by Shinken ? ==
A lot of stuffs ! But the more important are regexp configurations,
inherits_parents of hosts/services dependencies (always 1 in Shinken)
and notification escalations. It also do not have exclude timeperiod
support (like Nagios in fact ;) )

The current implementation doc is at
http://wiki.nagios-fr.org/nagios/shinken/start in french. I am writing
the english documentation, and it will be it's primary language in the
future.

== What is managed ? ==
All classics stuffs are managed (SOFT/HARD, complex inheritances,
volatile services, freshness, timeperiods with no exclude, flapping
states...). It also have NDO and Merlin database support in MySQL. It
also have NDO support with Oracle (yes, like Icinga)!! The NDO support
is not full, some objects are not managed (like notifications) but
it's not difficult to add them. It also supports UTF8 names.


==How do I test this freaking tool? ==
Just get the VirtualBox VM at http://www.megaupload.com/?d=57BGSL09
(yes, there can be a legal file in megaupload :) ). It's in OVF format
so you need to import it with Virtual Box.
It's a Ubuntu-server with DHCP nic, the account is shinken/shinken.
You can launch all daemons with:
./launch_all.sh
and kill all with :
./stop_all.sh

Look at the small README file to see how to watch output of daemons
(tail -f debug files). The current configuration is quite small (1500
services) so it will run with no problem. You have a Ninja interface
at http://IP_OF_THE_VM/ninja with monitor/monitor to watch the work.
Warning : Ninja do not seem to see more than one instance_id in
database, so you will see only half of hosts/services You can remove
one of the schedulers in etc/conf.cfg : all hosts will be add
automatically in the last active scheduler :)

You can test your current Nagios conf with Shinken, It will create
daemons configuration if need.


If you want to install it from scratch, it's not so difficult :
Shinken just need:
*python-2.6
*pyro (a Python module like Corba)
*python-graph-core (on Ubuntu : sudo apt-get install python-setuptools
&& sudo easy_install python-graph-core). I will drop this dependancie
soon (I just use a loop check, so a module for it is just too much...)

You can get the code with :
git clone git://shinken.git.sourceforge.net/gitroot/shinken/shinken

Remember to change etc/nagios.cfg and etc/conf.cfg with your directory
and, optionally, in conf.cfg the "plugin" object to put your ndo or
merlin database user/pass/database. You just need to launch in
shinken/src (here with 5 shells, no daemon for easy test):
python shinken-scheduler.py
python shinken-poller.py
python shinken-reactionner.py
python shinken-broker.py
python shinken-arbiter -c etc/nagios.cfg

== And now ?==
The proof of concept became a new implementation : it's now easier to
add missing features of Nagios into shinken than port features of
Shinken into the current Nagios.

I try to speak about this new implementation to some of this list
directly but they do not seem to be very kind of it. I understand
easily: just the processes pool is a hard work in C (and we cannot
take Apache code for it, not the good licence :( ) and it will change
a lot of Nagios internals. Change the reaping process by a socket is
quite hard too.

Yes, it breaks nearly everything, I know. It's not binary compatible
with event broker modules (merlin, ndo, live status) but I think
Nagios must evolve quicker that it does currently. Zenoss's evolution
is very impressive. Current Nagios implementation in C is good (it
does the work from the last 10 years!!). But like the drop of the old
CGI interface with PHP (Ninja in fact, because the new Nagios XI
interface is just not open source at all), we must keep all ideas of
what Nagios is (hosts, services, configuration with inheritances,
timeperiods) and put them in a new tool with a high level language.

I think C is not always the good language for tools. If we are afraid
of making a new architecture just because managing sockets/IPC is too
hard : we must change the language.
If the idea of dropping the old fork/fork/reaper way by a new one
based on processes pool and direct return in memory make you do
nightmares, we must change the language.
If the idea of a Zenoss began the new reference in OSS monitoring tool
just make you even worse nightmares : we must evolve quicker, so we
must change the language.

An example : for adding a new property in a Nagios object in the
current C code, we must add it in numerous files (config file reading,
object creation and so on). With a higher language like Python, it
just need ONE line and everything is managed after (inheritance,
object creation, default value, transformation from string to real
value like int or list of values).

== What I propose ? ==
It's just a Big Bang proposal : I propose Shinken to be the
development branch for Nagios core 4.

I think with help and tests, we can put all that Shinken do not do
that Nagios do and even more : we have an high availability
distributed and flexible Architecture. We can think of a new way of
getting information : the daemons have a HTTP server include (thanks
Python) and we put a REST interface for getting informations and
Setting orders (easier than nagios.cmd, especially in OS where there
are no named pipes :)).

I know some people will not be happy with it, and I don't ask to
forgot the current C implementation and put in production the new one
in one week. I do not want to fork Nagios. But I will make Shinken a
reality. I prefer it's name to be Nagios4. I will not allow this
freaking goods ideas of hosts, services, timeperiods, checks and
configuration inheritance became history just because we cannot evolve
like the others.

Darwin law is against us, make it be in our side.

== One last killing feature ==
One other good things about this implementation : it just run
everywhere Python runs, this including Windows!! I run Shinken in a
Seven VM with no problem. It can be very usefull for SMEs : they are
afraid about installing a Linux because they do not have an IT
administrator that know it. With a Windows support, it will allow
Nagios to enter in such enterprises.

Nagios usually do middle range monitoring : it manage IT from 20 to
300 hosts. With this new implementation, it will also easily manage
very small one to trully huge one (10000+ hosts in one node).

So, what now?


Gabès Jean
-------------- next part --------------
A non-text attachment was scrubbed...
Name: shinken-architecture.png
Type: image/png
Size: 112205 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20091201/c4a56535/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: shinken-architecture-global-realm.png
Type: image/png
Size: 133926 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20091201/c4a56535/attachment-0001.png>
-------------- next part --------------
------------------------------------------------------------------------------
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list