Nagios 3.0 SLA Reporting

Mohr James james.mohr at elaxy.com
Mon May 26 17:03:40 CEST 2008


Hi All!

We are in the process of moving from Nagios 2.5 to Nagios 3.0 with
MySQL. We monitor and report services for several customers and thus
have a number of SLAs to consider. Currently we have a self-written
reporting mechanism, but the developer is no longer with the company and
the documentation is lacking in many areas. Since we are using the
Nagios NDO, we would prefer not to try to force the old mechanism to
work with 3.0. So, we need a new reporting mechanism. 

I looked at a couple of tools, but found nothing which seems to be close
to finished and none that address adding downtimes after the fact. 

We cannot simply define a check_period or notification_period and
consider that, because we need to monitor 24x7 and more or less prove we
monitoring even if there is scheduled maintenance. Also there are cases
where the service is down and it is not our fault and per the SLA we do
not subtract that time from availability. Therefore, we need a mechanism
to be able to somehow add downtimes after the fact which then prevents
the reporting mechanisms from counting that time. 

NagiosSLA seems promising and I downloaded it from SourceForge. However,
I do not find any mechanism to manage the SLA periods other than simply
saying to reporting everything within the check_period. Since we are
using NDO, creating an extra EventHandler seems like a waste and the
report_script.pl seems to depend on the DB tables filled by the event
handler. Looking at the script, I do not seem much of a problem changing
the table and column names. However, as far as I can tell, the
sla_exclusion table is never really used. The exlusions are read into an
array ( my @exclusion = retrieveData("sla_exclusion"); but @exclusion is
never used after that. This means that every outage is reported.  

Since we already have the data in MySQL, I thought about simply using
the nagios_scheduleddowntime tables. However, I see a problem with
outages in the past. As far as I can tell, if you schedule an downtime
in the past, it is silently ignored. Also, from what I see, the table is
cleared when the outage is over. Both of these are logical to some
extent and I think my C is good enough to be able to modify the code to
either add all outages and not delete them, or maybe or straight-forward
simply write to a completly different table and avoid changing too much
existing code.  

So, the first question is whether there are any tools available to do
SLA Reporting properly, FOSS or commercial. If not, does anyone have any
suggestions about making changes to the existing code as I suggested?

I would be grateful for any input.

Regards,

Jim Mohr

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list