New Nagios implementation proposal

nap naparuba at gmail.com
Tue Dec 15 10:53:58 CET 2009


On Mon, Dec 14, 2009 at 1:37 PM, Andreas Ericsson <ae at op5.se> wrote:
> On 12/11/2009 04:30 PM, nap wrote:
>> On Fri, Dec 11, 2009 at 1:53 PM, Andreas Ericsson<ae at op5.se>  wrote:
>>
>>>
>>> Process pools aren't that hard to do in C really, but altering the
>>> entire concept of how Nagios operates is a fairly big change. OTOH, I'm
>>> not thrilled about the whole "check-results are stored in tempfiles"
>>> thing either, and *that* was a major change too.
>> Maybe we can first work in the "return in socket/memory" before try
>> the process pool. It must be easier and can have very huge effect.
>>
>
> That would be easier, yes. I once did a test of multiplexing check
> results and had very good results with it. The only problem is that
> it would require a double-fork() now, as checks would have to be
> wrapped in something to provide correct output with the microsecond
> execution time precision Nagios currently uses.

I don't understand the double-fork problem : whereas writing a flat
file, the son who popen the check just open a socket to the nagios
main process. Unstead of micro-sleep, nagios must select (just timeout
instead of sleep) the socket. It must put in queue the result for
reaping or maybe direct reap this result.

>
>>>
>>> Jean, let's discuss how we can move this forward within the C-code
>>> in such a way that we retain compatibility on all levels. Too many
>>> have invested too much in Merlin, NDOUtils and other C-based addons
>>> to relinquish them easily, and splitting the community again would
>>> be really, really stupid.
>> I'm agree with it. But I also think we cannot avoid a lot of years a
>> re-factory in order to use new tools like distributed object
>> technologies or dynamic development (you create properties for your
>> object, so you cut a lot part of your code). I know we can make greats
>> things in C. We will make great things in C for V4. But we must think
>> about long term development too.
>>
>
> Well, we could probably rewrite Nagios from scratch in a lot less than
> a year. Like most great things, it's not the implementation that's so
> spectacular but the idea behind it that is brilliant.
Yes

>
> I have no idea what you mean by "dynamic development". It's a hypeterm
> that can mean anything from "we let quality fluctuate wildly" to "we
> never really know what features the next release will hold". It's
> hardly ever anything good anyways.
<Warning> Python code just below :) </warning>

Believe me, I do not use this term in a marketing way. I just HATE the
marketing : you thing buy the best tool of the world, and in fact it
just do nothing the guy who sent it to you says. Here the dynamic is
not for the dynamic of the project or something like this. It's just
the capacity of Python for code introspection.

You can "attach" arrays in classes. You can also access class of an
object just by object.__class__. I use this in the macro resolver
part. I use one function to resolv a command, it take the command line
(with macros) and a list of object. It just do not care about with
object it is, it can be host, service, contact or whatever you want.
Let called this list "the context". Importants classes like hosts,
services or contact have a macros arrays : it list available macros
for the type and for each macro the property of the object that have
the information. For host we've got for example :
macros = {'HOSTADDRESS' : 'address',
[...]
'TOTALHOSTSERVICESOK' : 'get_total_services_ok',
}
The function for the macro resolver just do:
for macrosearch in the_context:
    for object in list:
        for macro in object.__class__.macros:
             if macro == macrosearch:
                property = object.__class__.macros[macro]
Ok, here we find the object that have the macro and we find the
property of this object that have the information. For a simple
property like $'HOSTADDRESS$, the value will be:
                value = getattr(object, 'property')
getattr is a Python function you can use if you want a property of an
object, but you don't know at the coding phase with one, so at running
time :
getattr(hst, 'address') = hst.address

For complex macros like TOTALHOSTSERVICESOK who is not a simple static
property, the macro resolver check if the property is "callable" (is a
function). if so, it just call it, the value will be the return of the
function. Here the hst.get_total_services_ok() just return the number
of ok services it gots. So it is:
                value = getattr(object, 'property')()

Is it not totally easy, it truly advanced python functions.
But now how do you add a new macro for an object? You just add an
entry in the macros arrays of the class. And that all! You do not have
to modify the macro resolver code. Your macro is just defined one
time, no duplication.

The macro resolver can be called by host check (no service), or
service check, or notification. It just depend of the command line you
want to resolve and the context.
It is what I call dynamic : you described an object, and all
operations are made at runtime with using this description, not
hardcoded loop. And the code I wrote is nearly python code, simple, no
{ or ;  ;)



The same logic is used in the objects creations : an array is in the
class named properties. It describes the properties that an object can
have. We've got for example for hosts:
properties = {
[..]
'retry_interval': {'required': False, 'default':'0', 'pythonize':
to_int, 'status_broker_name' : None},
[...]
}
Here the retry_interval properties is says to be:
*not required : if the property is not defined, it is not a problem,
the 'default' value will be take.
*default : the default use if not specified (and not required)
*pythonize : how to transform the "string" read from conf file to a
real object, here to_int is a function that take a string and change
to a int.
*status_broker_name : send to broker with a different name if not None

In the code of config read, transformation to objects, all
inheritances, default values, there is no mention of retry_interval.
All code are properties' loops. The properties are described in this
array, and just here. You want to add a property for host? Just add a
line in this array. That all. All grok reading, inheritance is made
"dynamically".



An other "dynamic magic" is used for modules : you just do not care
about the object you managed. It's the duck typing (if it quacks like
a duck, it must be a duck). There is no limitation of compilation like
fixed structure. A module can called the address property of an
object. It just don't care about all 30 others properties. Why a
module cannot be load if it just use the address because you add a new
property in the object? The module just don't care. It's a linking
problem (structure change). Why dynamic programming language no
problem : it just don't care about it. The module want to load an you
change the structure? No problem. If a property was removed, the
module will have an exception that it can catch if it want (I use it
in the broker code : if a module raised an exception, I just deload
the module).

Remember what we do for the parent hosts patch : I need a new temp
property to tag if the host was already checked. For keep module
loading ok, we put this property in the higher bits of a bool (in fact
an int32). In Python, you just add with :
hst.dfs_loop_check = 'OK'
And you remove it by :
del hst.dfs_loop_check

In this dynamic view, and object is seen like an array : you can add
or removed properties at runtime. In fact, properties of an object ARE
in a array (object.__dict__) :). If you add introspection
functionalities, you can really go from "hard code" in some way, to a
likely "dava driven" development (you described your objects and that
all). All of theses functionalities are add to all classic object
programming (all objects have a lot of common code like inheritances
or default filling. Even host and services have a lot of common code)
that just make the code smaller.

That is not perfect : no compilation mean no properties named checks
(you can try to add 60 to all retri_interval... no it's
retry_interval. Python will create retri_interval for all hosts :()
but there a ways to avoid it.

You do not have to code like this in Python. In fact you can code like
you already do in C. But if you began to uses theses functionalities,
believe me, you will never came back :)

For example for this easy programming, yesterday I add a new property
for hosts and service : hot_period (service can be tagged to be "low
priority" : critical -> warning, but in "hot_period", it stay in
critical (like end of the month for financial service ;) )). It take
me 2 minutes to add it! (1min to launch emacs, 1min to code ... ;) ).

>
>> I propose 2 things:
>> *we list ideas in Shinken absent of Nagios (process pool, return in
>> socket/memory, new options for services like inverse_ok_critical or
>> critical_is_warning) or Merlin (the "automatic cutting"
>> function/dispatching function) and we watch how put them into the
>> current code for the v4 for Nagios (next year? :) ) and v1 for Merlin.
>
> "process pool" and "return in memory" are not features. They're
> implementation details. What we need to do is to decide on a few
> problems in Nagios and work on them.
Yes.

>
> One such problem is the rather monolithic functions that have far
> too many side-effects, without any clear API's that modules can use
> to safely modify objects while Nagios is running. Refactoring that
> into manageable (and testable) pieces would be a worthwhile goal in
> and of itself.
>
>> *we open a "lab" or "long-term-dev" branch where we test things
>> without fearing of breaking the current modules. With such a branch,
>> everyone can begin to test and hack the code, see how it work, and
>> slowly redo everything that is done in the current code. It will call
>> new developers who are affrayed by C (yes, they are some :) ) so It
>> will not divided efforts on the main code. If this branch is a
>> success, we can put ideas from it to the main code, and try to make a
>> mix of theses branchs like you propose just above.
>>
>
> I'd actually prefer if new features are created on their own topic-
> branches so that each individual topic can be merged on its own rather
> than as a mass of co-dependant topics. Ofcourse, some topics will be
> co-dependant no matter what we try. In particular those who rely on
> API's introduced in some other topic, ofcourse.
Yes we can change the current implementation with small topics. But we
must also speak about the long term future. I still think we must
re-organise Nagios code to be modular. It will break module
compatibility (at least a recompilation). But why not try to see if
others language than C can be used? I am not saying we must use
Python, but C for a scheduler is not the best language. Scheduling is
a high level problem. Lets use all tools we can for solve it.


>
>> With this solution, community will not be divided in two, we will have
>> a "pool of ideas" branch and if it stabilizes in the long term, maybe
>> a good mix of the two worlds and give time to every one to peek into
>> and see how it work and if it can be used in some situations (like on
>> Windows for small environments) for testing.
>>
>> The main difficulty will be to keep the lab not too far from the main
>> branch, but with a common git, it must be easier than a fork or
>> something like that.
>>
>
> Yes, probably. Although I'm still sceptical about implementing parts
> of it in Python.
And after I show you how dynamic programming can be useful? :)

>
>>>
>>> Would it for example be possible to use Shinken as the checking
>>> engine that supplies check-results back to a C-based scheduler
>>> that retains config parsing and module compatibility? If that's
>>> the case, we might be on to something. Otherwise, we'd better get
>>> busy re-writing parts of the Nagios core to implement a process
>>> pool.
>> The "orders" for pollers are send with Pyro, a full python module. I
>> know we can load C code into Python, but it must be possible to load
>> Python into C. But this part of Shinken is not the more important. For
>> C Pool, we can watch for DNX (it's threads but if we remove XML from
>> it, it can be fast, isn't it? ).
>>
>
> It's definitely possible to load Python into C. That's what the Python
> interpreter does, after all.
>
> Imitating DNX is one plan ofcourse. Or we simply introduce a short
> binary protocol for the checking daemons to report their check-results
> back to Nagios. It's immensely simple and super-efficient. Especially
> since plugins only report one chunk of data as its output so only one
> pointer has to be recalculated with some really simple arithmetic.
That can be quite easy :)


Jean

>
> --
> Andreas Ericsson                   andreas.ericsson at op5.se
> OP5 AB                             www.op5.se
> Tel: +46 8-230225                  Fax: +46 8-230231
>
> Considering the successes of the wars on alcohol, poverty, drugs and
> terror, I think we should give some serious thought to declaring war
> on peace.
>
> ------------------------------------------------------------------------------
> Return on Information:
> Google Enterprise Search pays you back
> Get the facts.
> http://p.sf.net/sfu/google-dev2dev
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>

------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev




More information about the Developers mailing list