New Nagios implementation proposal

nap naparuba at gmail.com
Wed Dec 9 16:12:57 CET 2009


No other feed back ?

Maybe we can start a survey :

What do you think about think about changing the current implementation by a
new one based on Shinken in the dev branch for Nagios 4?
1 : Stupid, useless and dangerous idea, we can still go on with the current
implementation
2 : Can be look at, but have very slow chance of success
3 : Great idea!! Why isn't it already done?
4 : Obi Wan Kenobi (you just do not care about this or you just hate
surveys)

Thanks,


Gabès Jean


On Sun, Dec 6, 2009 at 1:26 PM, nap <naparuba at gmail.com> wrote:

> Hi,
>
> My goal is clear : Keep the Nagios/Icinga compatibility with a
> faster/modular implementation (and still fully open source of course).
>
> Instead of offlist we can maybe go on in the icinga devel-list ?
>
>
> Gabès Jean
>
>
>
> On Sat, Dec 5, 2009 at 3:14 AM, Michael Friedrich <
> michael.friedrich at univie.ac.at> wrote:
>
>>  Hi,
>>
>> very interesting approach :-)
>>
>> Maybe we can talk offlist and in private about your goals and maybe
>> joining forces with Icinga. How about that? :)
>>
>> Kind regards,
>> Michael
>>
>> nap wrote:
>>
>> Hi list,
>>
>> I would like to have your feed back about a (unfinished)
>> reimplementation of Nagios named "Shinken" I wrote in Python that is
>> faster and more modular than the current Nagios implementation in C
>> (yes faster, you read correctly. I was the first surprised by that).
>>
>> == The Shinken's history ==
>> Few months, I start to work on a proof of concept for Nagios focus on
>> distributed environments and performances. The main goal was to look
>> for a distributed and high availability architecture. I was also
>> thinking that Nagios' performances were quite good, but we can have
>> more.
>>
>> For quick test and development, I used Python. I thought a process
>> pool can make Nagios be quicker instead of forking a new process to
>> kill it few seconds after for each checks. I also bypass the reaping
>> way of Nagios : reading flat file is just too slow. Instead, the
>> results are a structure that is send directly to the scheduler. No
>> files, more performances. To be equal to Nagios, I add the same
>> monitoring logic in the scheduler : HARD/SOFT states, dependencies
>> (parents, servicedep, hostdep, etc) and database export (Merlin).
>> Shinken used the standard Nagios conf file.
>>
>> And the perf are quite good : with a Nagios3,  a small check (do a
>> echo + exit) and a medium range server I run at 10000 checks in
>> 5minutes (latency near 1s), 30K with full tweaks. With my tool, I run
>> 150K !!
>>
>>
>> == The global architecture ==
>> For the Architecture, I think we must use the Unix Way of doing things
>> : one tool by usage. For now, Nagios do nearly every things : reads
>> conf, schedule, launch checks and raise notifications. I try an
>> architecture where the administrator can have any host/services he
>> wants and the daemons are just resources to manage this. The
>> architecture I propose is the following :
>> *Arbiter : a daemon that read the configuration, cut it automatically
>> (keep relations like parents in the same conf) in N confs, where N is
>> the number of schedulers we have. It dispatchs the configuration and
>> also read the orders in nagios.cmd and dispatch orders to schedulers.
>> *Schedulers : do the scheduling by looking at states of
>> hosts/services. It just do checks/notifications/event handlers queues
>> for others daemons. Same things for event broker informations : it's
>> just a queue.
>> *pollers : use a processes Pool, get checks to launch in schedulers
>> and returns results to schedulers.
>> *reactionners : same than pollers, but for notifications and event handlers.
>> *brokers : get event broker informations from schedulers and "do
>> things" why them (like create the service-perfdata file, or fill
>> databases).
>>
>> The poller way of doing is like DNX, nothing new here. The
>> reactionners allow the administrators to have a unique daemon to send
>> all notifications of all his schedulers (usefull for SMTP
>> authorizations or the fill of a unique RSS file with all
>> notifications). The schedulers do not launch checks, so they do not
>> get latency when they launch notifications or event handlers.
>>
>> The load balancing is automatic : the arbiter cuts the conf and
>> dispatch thems. For the high availability : there can be spare daemons
>> : if a daemon die, another take it's configuration (the Arbiter "ping"
>> daemons, and if a daemon failed, it just send the configuration to a
>> spare). The daemon are reach by network, so all daemons can be in
>> different servers (and it's better for high availability to not put
>> all daemons in the same server :) ). For now, the Arbiter do not have
>> a spare, but it will be add in the future.
>>
>> You can see this Architecture in the file shinken-architecture.png.
>>
>> If the user configuration do not defined such daemons, Shinken
>> automatically create defaults one (in localhost with default ports).
>>
>> == Advanced architecture ==
>> In the architecture we saw, all reactionners/pollers/brokers take
>> orders from ALL schedulers. It can be a problem with reactionners
>> (with 3 SMTP servers (USA, Europe, Asia), it's hard to forced Asia
>> notifications to go in the Asia SMTP server). Same for poller : it
>> polls checks to run, and get checks from a very distant scheduler can
>> be very slow.
>> To manage this, Shinken use a way of cutting the architecture : Realms.
>>
>> A realm is a pool of daemons that work togethers. A host is tag with a
>> realm (and only one) so it will be managed by this realm's
>> schedulers/pollers/reactionners/brokers. A realm can have sub-realms
>> so you can put a reactionners in the higher Realm and it will managed
>> all schedulers of sub-realms. A picture is worth a thousand words. You
>> can have a better look of what realm is in the file
>> shinken-architecture-global-realm.png.
>>
>> Same for daemons : if the user configuration do not defined realm, a
>> default one is created by Shinken.
>>
>> == What is not managed by Shinken ? ==
>> A lot of stuffs ! But the more important are regexp configurations,
>> inherits_parents of hosts/services dependencies (always 1 in Shinken)
>> and notification escalations. It also do not have exclude timeperiod
>> support (like Nagios in fact ;) )
>>
>> The current implementation doc is athttp://wiki.nagios-fr.org/nagios/shinken/start in french. I am writing
>> the english documentation, and it will be it's primary language in the
>> future.
>>
>> == What is managed ? ==
>> All classics stuffs are managed (SOFT/HARD, complex inheritances,
>> volatile services, freshness, timeperiods with no exclude, flapping
>> states...). It also have NDO and Merlin database support in MySQL. It
>> also have NDO support with Oracle (yes, like Icinga)!! The NDO support
>> is not full, some objects are not managed (like notifications) but
>> it's not difficult to add them. It also supports UTF8 names.
>>
>>
>> ==How do I test this freaking tool? ==
>> Just get the VirtualBox VM at http://www.megaupload.com/?d=57BGSL09
>> (yes, there can be a legal file in megaupload :) ). It's in OVF format
>> so you need to import it with Virtual Box.
>> It's a Ubuntu-server with DHCP nic, the account is shinken/shinken.
>> You can launch all daemons with:
>> ./launch_all.sh
>> and kill all with :
>> ./stop_all.sh
>>
>> Look at the small README file to see how to watch output of daemons
>> (tail -f debug files). The current configuration is quite small (1500
>> services) so it will run with no problem. You have a Ninja interface
>> at http://IP_OF_THE_VM/ninja with monitor/monitor to watch the work.
>> Warning : Ninja do not seem to see more than one instance_id in
>> database, so you will see only half of hosts/services You can remove
>> one of the schedulers in etc/conf.cfg : all hosts will be add
>> automatically in the last active scheduler :)
>>
>> You can test your current Nagios conf with Shinken, It will create
>> daemons configuration if need.
>>
>>
>> If you want to install it from scratch, it's not so difficult :
>> Shinken just need:
>> *python-2.6
>> *pyro (a Python module like Corba)
>> *python-graph-core (on Ubuntu : sudo apt-get install python-setuptools
>> && sudo easy_install python-graph-core). I will drop this dependancie
>> soon (I just use a loop check, so a module for it is just too much...)
>>
>> You can get the code with :
>> git clone git://shinken.git.sourceforge.net/gitroot/shinken/shinken
>>
>> Remember to change etc/nagios.cfg and etc/conf.cfg with your directory
>> and, optionally, in conf.cfg the "plugin" object to put your ndo or
>> merlin database user/pass/database. You just need to launch in
>> shinken/src (here with 5 shells, no daemon for easy test):
>> python shinken-scheduler.py
>> python shinken-poller.py
>> python shinken-reactionner.py
>> python shinken-broker.py
>> python shinken-arbiter -c etc/nagios.cfg
>>
>> == And now ?==
>> The proof of concept became a new implementation : it's now easier to
>> add missing features of Nagios into shinken than port features of
>> Shinken into the current Nagios.
>>
>> I try to speak about this new implementation to some of this list
>> directly but they do not seem to be very kind of it. I understand
>> easily: just the processes pool is a hard work in C (and we cannot
>> take Apache code for it, not the good licence :( ) and it will change
>> a lot of Nagios internals. Change the reaping process by a socket is
>> quite hard too.
>>
>> Yes, it breaks nearly everything, I know. It's not binary compatible
>> with event broker modules (merlin, ndo, live status) but I think
>> Nagios must evolve quicker that it does currently. Zenoss's evolution
>> is very impressive. Current Nagios implementation in C is good (it
>> does the work from the last 10 years!!). But like the drop of the old
>> CGI interface with PHP (Ninja in fact, because the new Nagios XI
>> interface is just not open source at all), we must keep all ideas of
>> what Nagios is (hosts, services, configuration with inheritances,
>> timeperiods) and put them in a new tool with a high level language.
>>
>> I think C is not always the good language for tools. If we are afraid
>> of making a new architecture just because managing sockets/IPC is too
>> hard : we must change the language.
>> If the idea of dropping the old fork/fork/reaper way by a new one
>> based on processes pool and direct return in memory make you do
>> nightmares, we must change the language.
>> If the idea of a Zenoss began the new reference in OSS monitoring tool
>> just make you even worse nightmares : we must evolve quicker, so we
>> must change the language.
>>
>> An example : for adding a new property in a Nagios object in the
>> current C code, we must add it in numerous files (config file reading,
>> object creation and so on). With a higher language like Python, it
>> just need ONE line and everything is managed after (inheritance,
>> object creation, default value, transformation from string to real
>> value like int or list of values).
>>
>> == What I propose ? ==
>> It's just a Big Bang proposal : I propose Shinken to be the
>> development branch for Nagios core 4.
>>
>> I think with help and tests, we can put all that Shinken do not do
>> that Nagios do and even more : we have an high availability
>> distributed and flexible Architecture. We can think of a new way of
>> getting information : the daemons have a HTTP server include (thanks
>> Python) and we put a REST interface for getting informations and
>> Setting orders (easier than nagios.cmd, especially in OS where there
>> are no named pipes :)).
>>
>> I know some people will not be happy with it, and I don't ask to
>> forgot the current C implementation and put in production the new one
>> in one week. I do not want to fork Nagios. But I will make Shinken a
>> reality. I prefer it's name to be Nagios4. I will not allow this
>> freaking goods ideas of hosts, services, timeperiods, checks and
>> configuration inheritance became history just because we cannot evolve
>> like the others.
>>
>> Darwin law is against us, make it be in our side.
>>
>> == One last killing feature ==
>> One other good things about this implementation : it just run
>> everywhere Python runs, this including Windows!! I run Shinken in a
>> Seven VM with no problem. It can be very usefull for SMEs : they are
>> afraid about installing a Linux because they do not have an IT
>> administrator that know it. With a Windows support, it will allow
>> Nagios to enter in such enterprises.
>>
>> Nagios usually do middle range monitoring : it manage IT from 20 to
>> 300 hosts. With this new implementation, it will also easily manage
>> very small one to trully huge one (10000+ hosts in one node).
>>
>> So, what now?
>>
>>
>> Gabès Jean
>>
>>
>>
>> ------------------------------
>>
>>
>>  ------------------------------
>>
>>  ------------------------------
>>
>> ------------------------------------------------------------------------------
>> Join us December 9, 2009 for the Red Hat Virtual Experience,
>> a free event focused on virtualization and cloud computing.
>> Attend in-depth sessions from your desk. Your couch. Anywhere.http://p.sf.net/sfu/redhat-sfdev2dev
>>
>> ------------------------------
>>
>> _______________________________________________
>> Nagios-devel mailing listNagios-devel at lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/nagios-devel
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Join us December 9, 2009 for the Red Hat Virtual Experience,
>> a free event focused on virtualization and cloud computing.
>> Attend in-depth sessions from your desk. Your couch. Anywhere.
>> http://p.sf.net/sfu/redhat-sfdev2dev
>> _______________________________________________
>> Nagios-devel mailing list
>> Nagios-devel at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20091209/d1a1cd9f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 133926 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20091209/d1a1cd9f/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 112205 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20091209/d1a1cd9f/attachment-0001.png>
-------------- next part --------------
------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list