[naemon-dev] Ideas about future features

Robin Sonefors ozamosi at flukkost.nu
Sat Dec 28 00:22:46 CET 2013


On 2013-12-27 02:06, Matthias Eble wrote:
> 1) have a feature to monitor per metric rather than per check_command.
>     * Today, many plugins check lots of things.
>     * Typical example is check_disk, check_snmp.
>     * Depending on the configuration method, acknowledging a problem
> with /mytmpmount also disables notifications for /var
>       * To fix that, we'd need to create a stricter plugin output
> standard that contains per-metric status codes.
>          * metrics would be /mytmpmount-freespace, TCP-response-time,
> http-status-code or http-match-string
>       * The core would need to create sub-services at run-time and
> populate their results.
>       * Benefit: per metric actions and logging. Especially per metric
> downtime and acknowledgements
>       * Maybe it could also be used for receiving snmp traps or log
> pattern matching checks
>          * different alerting for different patterns/traps
>
>     * Today, many folks wrap the plugin call and submit results to a
> passive check.
>        * works, but all possible services need to be in the config.
>        * That's where folks start generating nagios configs and reload
> the daemon.
>        * Is that what we want? Maybe?
>           * problems arise when there are syntax problems
>
>     * raw proposal:
>    define service {
>    ...
>        check_command  check_disk
>      contact_group  os_admins
>      define metric {
>          metric_name  ^/oracle.*
>          contact_group oracle_admins
>      }
>    }
>
>     * maybe another layer could be added for check_multi-like plugins.
>        * but they could also be forced to structure metric names

I've been thinking about plugins and plugin architecture a bit.

The nagiosplugins project is talking about a new threshold format - 
https://www.nagios-plugins.org/doc/new-threshold-syntax.html - to 
achieve the same thing you want to solve in-core.

I think the nagiosplugins approach - basically, update all plugins to 
support a much more complex (though easier to understand) threshold 
format, because the old one was too complicated - is wrong. Programmers 
write buggy code, and telling programmers to write more code leads to 
more bugs (or at least I write buggy code, and I'm too stupid to write 
plugins already - that's why I stick to the core :P )

But I'm also not sure how far into the core the I'd want to put it. What 
if we, instead of either change the core or the plugins, write a plugin 
wrapper that takes a threshold as described by nagiosplugins and a 
plugin command line? It would simply parse the perfdata from the plugin, 
the threshold from the CLI, throw away the plugin exit code, and send a 
new, "imploved" exit code and stdout to naemon?

I feel this plugin wrapper approach would take the least amount of work 
to implement. Which problems would it leave unsolved?

 > [snip]
> What do you think? What's the focus of the dev-team?

So far, it seems the focus is mostly on cleanup. There's just *so* 
*much* ancient *crap* lying around. Tens of thousands of lines of code 
to create a ugly, useless web UI, which force me to ifdef every second 
line - what? Three (or so, I lost count) different configuration parsers 
for near-identical-yet-subtly-different configuration file formats - 
really? And then the amount of special casing for things that you might 
expect to behave similarly until you find out the hard way that they 
really don't - for instance, I always thought the flapping calculation 
was based on the last 20 (or so) check results, but nooo: 
https://github.com/naemon/naemon-core/blob/master/naemon/flapping.c#L116 
I think the current score is something like -120k lines compared to the 
initial code import, but there's a lot more we could do.

Oh, and testing. One of the scariest things when starting on a new 
codebase is realizing that there are no tests at all. The only thing 
worse than that is finding a directory full of tests - granted, all 
covered in cobweb and dust - and you think (or hope, or whatever you 
call that feeling when you know you must never assume good things but 
still want to) that you might have found The Book of Shadows in the 
attic, but after dusting the code off and flipping through it, it turns 
out that nobody has executed (or even compiled) any code here for 
*years*, and half the tests test features that doesn't even exist 
anymore. It dawns on you that somebody spent days - weeks, even - 
writing tests to avoid regressions - and then didn't run the test and 
thus didn't catch the regressions. See: t-tap, where a few of the tests 
files actually work, and none of them have a working build system ATM.

As far as I'm going to go in terms of longer-term vision and 
the-way-to-go-iness, I'd like to modularize the crap out of the core. 
The nagios "core" is anything but, as explained above. It would be neat 
to lift out a bunch of nagios functionality into a bundle of 
preinstalled modules. This would serve two purposes: it would force us 
to dogfood the broker API and thus help us improve it, and it would 
compartmentalize features (new, and old) to avoid weird interactions 
with other features.

The broker API as it exists is terrible - you're just given all of the 
naemon internals, spotty and inconsistent hooks, and a "good luck". This 
means that, as a core developer, any change I make at all is bound to 
break some module, while as a module author, I need to learn all of the 
core to write a module. And you want to store your own add-on 
configuration/data? Hah! So, in the end, it's just easier to become a 
core contributor, because who has the time not to?

What would happen if, to take an example that sounds weird but makes 
some kind of sense, the flapping functionality was a module? That would 
require some extra module functionality - modules would have to be able 
to add configuration statements to the config (global and per-object) 
for configuring flapping thresholds, and modules would have to be able 
to couple state (is_flapping, last 20 check results) with the object and 
have it persist between restarts. Now, what if this was the easiest, 
most concise, and easiest-to-find-out-how way to do it?

I think a module should be able to do all these things - and if it could 
do that, and if flapping was a module, I would not ever again have to 
worry about flapping in the remaining core, nor would I wonder where all 
special cases for flapping are handled - heck, I could even see if the 
flapping feature has tests and how extensive they are, just from looking 
at github.com/naemon/flapping ! Today, almost all features - including 
flapping - is handled by the pair of ogres known as 
handle_async_service_check_result/handle_async_host_check_result - 
looking at the code, I have no idea what it will actually end up doing 
for each case, but I'm quite sure a few of the code paths are buggy - 
because that many untested if conditions just aren't going to all be 
correct. Modularizing away the if statements (all of them, all over the 
core) should render a more consistent, less buggy monitoring solution.

tl;dr: naemon should allow contributors to write modules that are much 
more powerful than today's broker modules, to make it possible and easy 
to write a module to add seemingly built-in functionality, like metrics 
and exceptions - then, we could start to write such modules, go crazy, 
and see what comes out!


More information about the Naemon-dev mailing list