[Nagios-users] [PATCH] reduce notification load; fix $NOTIFICATIONRECIPIENTS$ macro #98

Andreas Ericsson ae at op5.se
Tue Nov 22 10:31:39 CET 2011


On 11/22/2011 01:02 AM, Michael Friedrich wrote:
> On 21.11.2011 20:56, Andreas Ericsson wrote:
>> On 11/01/2011 02:05 PM, Michael Friedrich wrote:
>>> hi,
>>>
>>> recently we've been debugging on team icinga in the middle of
>>> notifications and macros, and while investigating on a users problem,
>>> we've digged a bit deeper into the notification viability checks,
>>> resulting in deeper analysis of an Opsview patch to reduce the
>>> notification load significantly by moving the viability checks from
>>> the actual notification into the creation of the contacts notified,
>>> passing only a list of 'qualified' contacts to the actual
>>> notification logic. the only thing to remark over here is that the
>>> checks against the valid notification_period now happen sooner, and
>>> not actually when the notification is sent to each contact.
>>>
>>> while implementing that patch into current code (needs some macro
>>> passing with current code), we did remember nagios bug #98 where the
>>> $NOTIFICATIONRECEIPIENTS$ macro is demanded to be only populated with
>>> the actual contacts to be notified, but not all of those assigned to
>>> the host/service. while this is considered to be a real bug, further
>>> investigation showed that thanks to the viability checks before
>>> calling add_notification(), contacts won't be added to that macro as
>>> the macro logic happens within that function too.
>>>
>>> so by applying the attached git patch, you will a. reduce
>>> notification load and b. fix the $NOTIFICATIONRECEIPIENTS$ macro
>>> holding all contacts, but not the viable contacts.
>>>
>>> since the code remains actually the same on icinga and nagios in this
>>> stage, the tests can be found at the icinga dev tracker as usual.
>>> https://dev.icinga.org/issues/1744
>>> https://dev.icinga.org/issues/2023
>>>
>>
>> I've started looking into this patch right now. It's good to get that
>> issue (#98) fixed, but I fail to see any noticeable performance
>> improvement. All contacts potentionally viable for being contacted are
>> still looked at, but the difference with this patch is that it checks
>> the viability before shipping it off to add_notification(), which does
>> fix issue 98 but at the expense of quite a lot of code duplication.
> 
> normally, all contacts would have been added to the notification_list in
> memory, even those not actually passing the viability checks. but at
> this stage of the code, nobody is aware of that so the list gets
> populated either way by calling add_notification().
> 
> /* add all individual contacts for this host */
>          ^^^
> 
> having that notification_list created, this remains fully linked in
> memory. let's say, you have a bunch of some 1k contacts for that
> service, and actually the alarm would hit only those in the nonworkhours
> or workhours timeperiod and only on critical, for the ops team e.g.
> so by looping through the notification_list, you will encounter *all*
> contacts for that host, only the duplicates have been removed.
> 
> /* notify each contact (duplicates have been removed) */
> 
> then you'll fire up the actual notification with calling
> notify_contact_of_host - and actually in there, the current core checks
> the viability for the contact to be notified.
> 
> you are right, if each contact gets notified 24x7 on all
> notification_options, the algorithm stays the same. but if you happen to
> have a lot of different contacts assigned to hosts and services, not
> getting notified each time a notification is triggered, the overall
> amount of looping through notification_list will be shorter and save
> some cpu cycles, and probably on larger systems, a bit more than just
> some as this means a reduction of the looping for each contact to be
> checked to be notified on the actual end-of-the-line.
> 

Right, but all contacts are still checked for viability, so the amount
of looping is reduced once for all those who aren't viable, while the
number of viability checks (which I presume is the expensive part) will
remain the same.

> furthermore, where do you get the idea of code duplication from? the
> only changes made by this patch is actually moving the viability checks
> and therefore passing an additional function parameter which makes the
> diff a bit more bloated than it should be.
> 

The fact that the patch introduces eight locations with identical code
headed with "check now if contact can be notified".

The proper way to do this would be to introduce create_recipient_list(),
passing it all the variables it needs to produce a list of recipients
that have duplicates removed *and* are viable for being contacted. If a
lot of code still has to be duplicated (as in the patch), more helpers
in the form of add_recipient_for_service(&mac, srv, cntct) would be
nifty so the viability check can be moved there without breaking the
abi for create_notification_list_from_{host,service}().

I'm in the middle of a release at $dayjob so I've had to postpone that
until next week or so. I wouldn't mind if you beat me to it ;)

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d




More information about the Developers mailing list