New Nagios implementation proposal

nap naparuba at gmail.com
Wed Dec 2 08:00:45 CET 2009


Hi,

You are right. I also test with my real production configuration (7000
services) and the load averege of the server was 5 time less that my
Nagios server. I also try to bench with a "echo; sleep 0.5; exit" and
it was still very high. I will try the "random 1 ->10".

In realm world the perf will be slower than my ideal test environment,
but at least, the performances will be limited by the checks we
launch, not by the scheduler.



Gabès Jean

On Tue, Dec 1, 2009 at 9:43 PM, Martin Melin <mmelin at gmail.com> wrote:
> I don't want to sound negative, but the performance of your clone with
> "echo; exit;" checks isn't really interesting. In a real-world Nagios
> installation, checks take a *long* time to execute because they are
> dependent on so many factors external to the local machine.
>
> To get more realistic numbers, I would suggest using a plugin that does a
> random-length sleep, between 0 and 10 seconds long.
>
> The reason why your numbers are not realistic is because you're testing for
> the wrong thing: in your test environment, the system spends most of its
> time spawning and handling processes, reporting back data etc. Of course
> you'll get better performance by drastically reducing the total population
> of processes on the machine. But in a real-world Nagios environment, there
> will be a lot of time spent waiting on the network & remote machines - this
> time is simply not present in your test case.
>
> This is not to say that I don't believe that your solution may be an
> improvement or have better performance - it is saying that nobody knows,
> because your test parameters are not anything like the real world.
>
> All the best and looking forward to seeing your code,
>
> Martin Melin
>
> On Tue, Dec 1, 2009 at 8:07 PM, William Leibzon <william at leibzon.org> wrote:
>>
>> Go and program it but don't hope that this will ever become nagios4 -
>> nagios name is trademark and its entirely up to Ethan what it gets used for.
>> But that does not mean its bad to have different compatible (as far as
>> config and database and frontened).
>>
>> Nagios has very long history so chances of rewrite in different language
>> are very low although I kind-of hope people would open up to at least C++.
>> But as far as language for something alike nagios, but new, python is not
>> bad but its rather inflexible in how it forces you to write (both good and
>> bad, don't start pointless discussion here) and does not give you quite as
>> much flexibility for memory and few other ops. Also looking at your
>> architecture I immediately see that best language for it is probably Erlang,
>> but then finding people to support and develop it further would be a lot
>> more difficult. So if you want to do it in python go for it, just don't hope
>> that id would become nagios4 but do report back on your progress time-time
>> if you like (this would call for plugins written in python too, and
>> preferred language is actually perl). Personally if I'm to consider
>> rewriting nagios in interpreted language I'd wait for Perl6 to come out,
>> it'd be taken a lot better by nagios users.
>>
>> On Tue, Dec 1, 2009 at 9:09 AM, nap <naparuba at gmail.com> wrote:
>>>
>>> Hi list,
>>>
>>> I would like to have your feed back about a (unfinished)
>>> reimplementation of Nagios named "Shinken" I wrote in Python that is
>>> faster and more modular than the current Nagios implementation in C
>>> (yes faster, you read correctly. I was the first surprised by that).
>>>
>>> == The Shinken's history ==
>>> Few months, I start to work on a proof of concept for Nagios focus on
>>> distributed environments and performances. The main goal was to look
>>> for a distributed and high availability architecture. I was also
>>> thinking that Nagios' performances were quite good, but we can have
>>> more.
>>>
>>> For quick test and development, I used Python. I thought a process
>>> pool can make Nagios be quicker instead of forking a new process to
>>> kill it few seconds after for each checks. I also bypass the reaping
>>> way of Nagios : reading flat file is just too slow. Instead, the
>>> results are a structure that is send directly to the scheduler. No
>>> files, more performances. To be equal to Nagios, I add the same
>>> monitoring logic in the scheduler : HARD/SOFT states, dependencies
>>> (parents, servicedep, hostdep, etc) and database export (Merlin).
>>> Shinken used the standard Nagios conf file.
>>>
>>> And the perf are quite good : with a Nagios3,  a small check (do a
>>> echo + exit) and a medium range server I run at 10000 checks in
>>> 5minutes (latency near 1s), 30K with full tweaks. With my tool, I run
>>> 150K !!
>>>
>>>
>>> == The global architecture ==
>>> For the Architecture, I think we must use the Unix Way of doing things
>>> : one tool by usage. For now, Nagios do nearly every things : reads
>>> conf, schedule, launch checks and raise notifications. I try an
>>> architecture where the administrator can have any host/services he
>>> wants and the daemons are just resources to manage this. The
>>> architecture I propose is the following :
>>> *Arbiter : a daemon that read the configuration, cut it automatically
>>> (keep relations like parents in the same conf) in N confs, where N is
>>> the number of schedulers we have. It dispatchs the configuration and
>>> also read the orders in nagios.cmd and dispatch orders to schedulers.
>>> *Schedulers : do the scheduling by looking at states of
>>> hosts/services. It just do checks/notifications/event handlers queues
>>> for others daemons. Same things for event broker informations : it's
>>> just a queue.
>>> *pollers : use a processes Pool, get checks to launch in schedulers
>>> and returns results to schedulers.
>>> *reactionners : same than pollers, but for notifications and event
>>> handlers.
>>> *brokers : get event broker informations from schedulers and "do
>>> things" why them (like create the service-perfdata file, or fill
>>> databases).
>>>
>>> The poller way of doing is like DNX, nothing new here. The
>>> reactionners allow the administrators to have a unique daemon to send
>>> all notifications of all his schedulers (usefull for SMTP
>>> authorizations or the fill of a unique RSS file with all
>>> notifications). The schedulers do not launch checks, so they do not
>>> get latency when they launch notifications or event handlers.
>>>
>>> The load balancing is automatic : the arbiter cuts the conf and
>>> dispatch thems. For the high availability : there can be spare daemons
>>> : if a daemon die, another take it's configuration (the Arbiter "ping"
>>> daemons, and if a daemon failed, it just send the configuration to a
>>> spare). The daemon are reach by network, so all daemons can be in
>>> different servers (and it's better for high availability to not put
>>> all daemons in the same server :) ). For now, the Arbiter do not have
>>> a spare, but it will be add in the future.
>>>
>>> You can see this Architecture in the file shinken-architecture.png.
>>>
>>> If the user configuration do not defined such daemons, Shinken
>>> automatically create defaults one (in localhost with default ports).
>>>
>>> == Advanced architecture ==
>>> In the architecture we saw, all reactionners/pollers/brokers take
>>> orders from ALL schedulers. It can be a problem with reactionners
>>> (with 3 SMTP servers (USA, Europe, Asia), it's hard to forced Asia
>>> notifications to go in the Asia SMTP server). Same for poller : it
>>> polls checks to run, and get checks from a very distant scheduler can
>>> be very slow.
>>> To manage this, Shinken use a way of cutting the architecture : Realms.
>>>
>>> A realm is a pool of daemons that work togethers. A host is tag with a
>>> realm (and only one) so it will be managed by this realm's
>>> schedulers/pollers/reactionners/brokers. A realm can have sub-realms
>>> so you can put a reactionners in the higher Realm and it will managed
>>> all schedulers of sub-realms. A picture is worth a thousand words. You
>>> can have a better look of what realm is in the file
>>> shinken-architecture-global-realm.png.
>>>
>>> Same for daemons : if the user configuration do not defined realm, a
>>> default one is created by Shinken.
>>>
>>> == What is not managed by Shinken ? ==
>>> A lot of stuffs ! But the more important are regexp configurations,
>>> inherits_parents of hosts/services dependencies (always 1 in Shinken)
>>> and notification escalations. It also do not have exclude timeperiod
>>> support (like Nagios in fact ;) )
>>>
>>> The current implementation doc is at
>>> http://wiki.nagios-fr.org/nagios/shinken/start in french. I am writing
>>> the english documentation, and it will be it's primary language in the
>>> future.
>>>
>>> == What is managed ? ==
>>> All classics stuffs are managed (SOFT/HARD, complex inheritances,
>>> volatile services, freshness, timeperiods with no exclude, flapping
>>> states...). It also have NDO and Merlin database support in MySQL. It
>>> also have NDO support with Oracle (yes, like Icinga)!! The NDO support
>>> is not full, some objects are not managed (like notifications) but
>>> it's not difficult to add them. It also supports UTF8 names.
>>>
>>>
>>> ==How do I test this freaking tool? ==
>>> Just get the VirtualBox VM at http://www.megaupload.com/?d=57BGSL09
>>> (yes, there can be a legal file in megaupload :) ). It's in OVF format
>>> so you need to import it with Virtual Box.
>>> It's a Ubuntu-server with DHCP nic, the account is shinken/shinken.
>>> You can launch all daemons with:
>>> ./launch_all.sh
>>> and kill all with :
>>> ./stop_all.sh
>>>
>>> Look at the small README file to see how to watch output of daemons
>>> (tail -f debug files). The current configuration is quite small (1500
>>> services) so it will run with no problem. You have a Ninja interface
>>> at http://IP_OF_THE_VM/ninja with monitor/monitor to watch the work.
>>> Warning : Ninja do not seem to see more than one instance_id in
>>> database, so you will see only half of hosts/services You can remove
>>> one of the schedulers in etc/conf.cfg : all hosts will be add
>>> automatically in the last active scheduler :)
>>>
>>> You can test your current Nagios conf with Shinken, It will create
>>> daemons configuration if need.
>>>
>>>
>>> If you want to install it from scratch, it's not so difficult :
>>> Shinken just need:
>>> *python-2.6
>>> *pyro (a Python module like Corba)
>>> *python-graph-core (on Ubuntu : sudo apt-get install python-setuptools
>>> && sudo easy_install python-graph-core). I will drop this dependancie
>>> soon (I just use a loop check, so a module for it is just too much...)
>>>
>>> You can get the code with :
>>> git clone git://shinken.git.sourceforge.net/gitroot/shinken/shinken
>>>
>>> Remember to change etc/nagios.cfg and etc/conf.cfg with your directory
>>> and, optionally, in conf.cfg the "plugin" object to put your ndo or
>>> merlin database user/pass/database. You just need to launch in
>>> shinken/src (here with 5 shells, no daemon for easy test):
>>> python shinken-scheduler.py
>>> python shinken-poller.py
>>> python shinken-reactionner.py
>>> python shinken-broker.py
>>> python shinken-arbiter -c etc/nagios.cfg
>>>
>>> == And now ?==
>>> The proof of concept became a new implementation : it's now easier to
>>> add missing features of Nagios into shinken than port features of
>>> Shinken into the current Nagios.
>>>
>>> I try to speak about this new implementation to some of this list
>>> directly but they do not seem to be very kind of it. I understand
>>> easily: just the processes pool is a hard work in C (and we cannot
>>> take Apache code for it, not the good licence :( ) and it will change
>>> a lot of Nagios internals. Change the reaping process by a socket is
>>> quite hard too.
>>>
>>> Yes, it breaks nearly everything, I know. It's not binary compatible
>>> with event broker modules (merlin, ndo, live status) but I think
>>> Nagios must evolve quicker that it does currently. Zenoss's evolution
>>> is very impressive. Current Nagios implementation in C is good (it
>>> does the work from the last 10 years!!). But like the drop of the old
>>> CGI interface with PHP (Ninja in fact, because the new Nagios XI
>>> interface is just not open source at all), we must keep all ideas of
>>> what Nagios is (hosts, services, configuration with inheritances,
>>> timeperiods) and put them in a new tool with a high level language.
>>>
>>> I think C is not always the good language for tools. If we are afraid
>>> of making a new architecture just because managing sockets/IPC is too
>>> hard : we must change the language.
>>> If the idea of dropping the old fork/fork/reaper way by a new one
>>> based on processes pool and direct return in memory make you do
>>> nightmares, we must change the language.
>>> If the idea of a Zenoss began the new reference in OSS monitoring tool
>>> just make you even worse nightmares : we must evolve quicker, so we
>>> must change the language.
>>>
>>> An example : for adding a new property in a Nagios object in the
>>> current C code, we must add it in numerous files (config file reading,
>>> object creation and so on). With a higher language like Python, it
>>> just need ONE line and everything is managed after (inheritance,
>>> object creation, default value, transformation from string to real
>>> value like int or list of values).
>>>
>>> == What I propose ? ==
>>> It's just a Big Bang proposal : I propose Shinken to be the
>>> development branch for Nagios core 4.
>>>
>>> I think with help and tests, we can put all that Shinken do not do
>>> that Nagios do and even more : we have an high availability
>>> distributed and flexible Architecture. We can think of a new way of
>>> getting information : the daemons have a HTTP server include (thanks
>>> Python) and we put a REST interface for getting informations and
>>> Setting orders (easier than nagios.cmd, especially in OS where there
>>> are no named pipes :)).
>>>
>>> I know some people will not be happy with it, and I don't ask to
>>> forgot the current C implementation and put in production the new one
>>> in one week. I do not want to fork Nagios. But I will make Shinken a
>>> reality. I prefer it's name to be Nagios4. I will not allow this
>>> freaking goods ideas of hosts, services, timeperiods, checks and
>>> configuration inheritance became history just because we cannot evolve
>>> like the others.
>>>
>>> Darwin law is against us, make it be in our side.
>>>
>>> == One last killing feature ==
>>> One other good things about this implementation : it just run
>>> everywhere Python runs, this including Windows!! I run Shinken in a
>>> Seven VM with no problem. It can be very usefull for SMEs : they are
>>> afraid about installing a Linux because they do not have an IT
>>> administrator that know it. With a Windows support, it will allow
>>> Nagios to enter in such enterprises.
>>>
>>> Nagios usually do middle range monitoring : it manage IT from 20 to
>>> 300 hosts. With this new implementation, it will also easily manage
>>> very small one to trully huge one (10000+ hosts in one node).
>>>
>>> So, what now?
>>>
>>>
>>> Gabès Jean
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Join us December 9, 2009 for the Red Hat Virtual Experience,
>>> a free event focused on virtualization and cloud computing.
>>> Attend in-depth sessions from your desk. Your couch. Anywhere.
>>> http://p.sf.net/sfu/redhat-sfdev2dev
>>> _______________________________________________
>>> Nagios-devel mailing list
>>> Nagios-devel at lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Join us December 9, 2009 for the Red Hat Virtual Experience,
>> a free event focused on virtualization and cloud computing.
>> Attend in-depth sessions from your desk. Your couch. Anywhere.
>> http://p.sf.net/sfu/redhat-sfdev2dev
>> _______________________________________________
>> Nagios-devel mailing list
>> Nagios-devel at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>>
>
>
> ------------------------------------------------------------------------------
> Join us December 9, 2009 for the Red Hat Virtual Experience,
> a free event focused on virtualization and cloud computing.
> Attend in-depth sessions from your desk. Your couch. Anywhere.
> http://p.sf.net/sfu/redhat-sfdev2dev
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>
>

------------------------------------------------------------------------------
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev




More information about the Developers mailing list