nagios writing escalation rules multiple times to objects.cache

Chris Baldwin oogs at umich.edu
Mon Oct 1 21:37:04 CEST 2012


Short version:

I have an ever-growing Nagios install for monitoring a bunch of linux 
hosts (currently 99 hosts & 2322 services, I plan on adding 115 more 
hosts & 1500+ services). I've noticed something odd with my escalation 
rules - they're being repeated multiple times in my objects.cache file. 
This is started to affect performance for parts of my nagios install, to 
the point where it's painfully slow to use the web interface.

My google-fu is weak today, so I was hoping someone here could point me 
in the right direction.

Longer version:

I have 4 escalation rules:
-Our helpdesk gets notification #1 for critical issues.
-Our on-call person gets notifications 1 -> 12 @ 5 minute intervals 24x7.
-The relevant IT-group leader(s) get notifications 5->12 @ 5 minute 
intervals during on call periods.
-Our CIO gets notification 12 -> infinity at 60 minute intervals during 
on call periods.

We use puppet to control our environment, and it's amazing for deploying 
servers and adding them to nagios. Once I'm able to bring in other 
aspects of our environment under puppet control (firewall, sudo, yum 
repos), it will be trivial to set up a server from scratch and monitor it.

In order to create a new set of escalation rules, we use a custom class 
on the puppet server and a small bit of code to be executed from the 
client-side (of puppet) to make this work. An example:

         # Escalate to the_boss. He, in turn, will call people. I 
imagine this
         # to be along the lines of Hulk nudging Thor playfully in The
         # Avengers. And sending him flying through a few bulkheads.
         nagios::server::escalations { "Boss-critical":
                 contact_groups          => "the_boss",
                 escalation_options      => "c,r",
                 escalation_period       => "oncall_hours",
                 first_notification      => "12",
                 last_notification       => "0",
                 notification_interval   => "60",
                 servicegroup_name       => 
"Disk,Ping,HTTP,Load,MySQL,Ping,Procs,SSH,Swap,Users,Zombie",
         }

I know this portion works correctly - it's producing my desired result, 
which is 1 file per (set) of escalation rules specified. I have 1722 
escalation cfg files.

The cfg files look something like this:

         define serviceescalation{
             contact_groups          the_boss
             escalation_options      c,r
             escalation_period       oncall_hours
             first_notification      12
             host_name               my.hostname.xyz
             last_notification       0
             notification_interval   60
             #service_description 
Disk,Ping,HTTP,Load,MySQL,Ping,Procs,SSH,Swap,Users,Zombie
             servicegroup_name 
Disk,Ping,HTTP,Load,MySQL,Ping,Procs,SSH,Swap,Users,Zombie
         }

The rules themselves live in the following directory structure: 
/etc/nagios/escalations/$hostname/$rulename.cfg , and nagios.cfg has an 
entry to read /etc/nagios/escalations/ as a whole.

The rules are written to objects.cache as:

         define serviceescalation {
             host_name       my.hostname.xyz
             service_description     Zombie
             first_notification      12
             last_notification       0
             notification_interval   60.000000
             escalation_period       oncall_hours
             escalation_options      c,r
             contacts        jabberbot-con
             contact_groups  the_boss
         }

In case you're wondering, the reason we don't wildcard stuff is so we 
can control it on a per-host basis. It could be that host uvw doesn't 
require us to monitor MySQL processes, as MySQL isn't installed there. 
Having an escalation for a non-existing service would mean the nagios 
config check fails, etc.

Now, when I look at my objects.cache file, I see this:
Rule #1
Rule #2
Rule #3
Rule #4
(repeat 98 more times)

I see the same if I look at a different host - that is, 99 copies of a 
rule that is particular to that host. Instead of having 9288 escalation 
rules, I have over 900000 (900 thousand).

I looked at my test nagios install (which has a smaller pool of hosts, 
completely unrelated to my live environment), and it exhibits the same 
issue. The pool is just small enough that the size of objects.cache 
didn't matter.

My questions to you guys:
- Am I crazy to think that it's reading every rule once for *each* 
server? I thought it was a coincidence, but it's happening in my test 
setup as well, which is in a completely separate VDC.
- Have you seen this before? If so, how did you fix it?
- What else should I look at?

I'm stumped. I can't find anything tell-tale in logs, strace produces a 
mountain of gibberish, and I haven't turned up anything online.

-Chris B.

Some more info, as I'm sure you'll ask for this:

I tried using the precache, it didn't help. Both files were created by 
my nagios install.
#ls -la | grep objects
-rw-r--r--   1 nagios nagios 251616779 Oct  1 14:17 objects.cache
-rw-r--r--   1 nagios nagios 251616779 Oct  1 14:16 objects.precache
(that's 251mb)

# nagios -v
Nagios Core 3.3.1
# yum list nagios
Installed Packages
nagios.x86_64 3.3.1-3.el6                            @epel

# uname -a
Linux nagios.hostname.xyz 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 
19:48:22 GMT 2011 x86_64 x86_64 x86_64 GNU/Linux
(Centos 6.3)

And lastly, my config. I'll be the first to admit it needs some more 
tweaking, however it's working reasonably well now.
# more nagios.cfg
##############################################################################
#
# NAGIOS.CFG - Sample Main Config File for Nagios 3.3.1
#
# Read the documentation for more information on this configuration
# file.  I've provided some comments here, but things may not be so
# clear without further explanation.
#
# Last Modified: 12-14-2008
#
##############################################################################
log_file=/var/log/nagios/nagios.log
# You can specify individual object config files as shown below:
cfg_file=/etc/nagios/objects/commands.cfg
cfg_file=/etc/nagios/objects/contacts.cfg
cfg_file=/etc/nagios/objects/timeperiods.cfg
cfg_file=/etc/nagios/objects/templates.cfg
cfg_file=/etc/nagios/objects/hostgroups.cfg
cfg_file=/etc/nagios/objects/oncall.cfg

# You can also tell Nagios to process all config files (with a .cfg
# extension) in a particular directory by using the cfg_dir
# directive as shown below:
cfg_dir=/etc/nagios/escalations
cfg_dir=/etc/nagios/servers
cfg_dir=/etc/nagios/services
cfg_dir=/etc/nagios/hostgroups
cfg_dir=/etc/nagios/servicegroups

object_cache_file=/var/log/nagios/objects.cache
precached_object_file=/var/log/nagios/objects.precache
resource_file=/etc/nagios/private/resource.cfg
status_file=/var/log/nagios/status.dat
status_update_interval=10
nagios_user=nagios
nagios_group=nagios
check_external_commands=1
#command_check_interval=15s
command_check_interval=-1
command_file=/var/spool/nagios/cmd/nagios.cmd
external_command_buffer_slots=4096
lock_file=/var/run/nagios.pid
temp_file=/var/log/nagios/nagios.tmp
temp_path=/tmp
event_broker_options=-1
broker_module=/usr/lib64/nagios/brokers/npcdmod.o 
config_file=/etc/pnp4nagios/npcd.cfg
log_rotation_method=d
log_archive_path=/var/log/nagios/archives
use_syslog=1
log_notifications=1
log_service_retries=1
log_host_retries=1
log_event_handlers=1
log_initial_states=0
log_external_commands=1
log_passive_checks=1
global_service_event_handler=remove_service_ack
service_inter_check_delay_method=0.01
max_service_check_spread=30
service_interleave_factor=s
host_inter_check_delay_method=0.02
max_host_check_spread=30
max_concurrent_checks=0
check_result_reaper_frequency=10
max_check_result_reaper_time=30
check_result_path=/var/log/nagios/spool/checkresults
max_check_result_file_age=3600
cached_host_check_horizon=15
cached_service_check_horizon=15
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
soft_state_dependencies=0
#time_change_threshold=900
auto_reschedule_checks=0
auto_rescheduling_interval=30
auto_rescheduling_window=180
sleep_time=0.25
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=1
state_retention_file=/var/log/nagios/retention.dat
retention_update_interval=60
use_retained_program_state=1
use_retained_scheduling_info=1
retained_host_attribute_mask=0
retained_service_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
interval_length=60
check_for_updates=1
bare_update_check=0
use_aggressive_host_checking=0
execute_service_checks=1
accept_passive_service_checks=1
execute_host_checks=1
accept_passive_host_checks=1
enable_notifications=1
enable_event_handlers=1
process_performance_data=1
#host_perfdata_command=process-host-perfdata
#service_perfdata_command=process-service-perfdata
#host_perfdata_file=/tmp/host-perfdata
#service_perfdata_file=/tmp/service-perfdata
#host_perfdata_file_template=[HOSTPERFDATA]\t$TIMET$\t$HOSTNAME$\t$HOSTEXECUTIONTIME
$\t$HOSTOUTPUT$\t$HOSTPERFDATA$
#service_perfdata_file_template=[SERVICEPERFDATA]\t$TIMET$\t$HOSTNAME$\t$SERVICEDESC
$\t$SERVICEEXECUTIONTIME$\t$SERVICELATENCY$\t$SERVICEOUTPUT$\t$SERVICEPERFDATA$
#host_perfdata_file_mode=a
#service_perfdata_file_mode=a
#host_perfdata_file_processing_interval=0
#service_perfdata_file_processing_interval=0
#host_perfdata_file_processing_command=process-host-perfdata-file
#service_perfdata_file_processing_command=process-service-perfdata-file
obsess_over_services=0
#ocsp_command=somecommand
obsess_over_hosts=0
#ochp_command=somecommand
translate_passive_host_checks=0
passive_host_checks_are_soft=0
check_for_orphaned_services=1
check_for_orphaned_hosts=1
check_service_freshness=1
service_freshness_check_interval=60
check_host_freshness=0
host_freshness_check_interval=60
additional_freshness_latency=15
enable_flap_detection=1
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0
date_format=us
#use_timezone=US/Mountain
#use_timezone=Australia/Brisbane
p1_file=/usr/sbin/p1.pl
enable_embedded_perl=1
use_embedded_perl_implicitly=1
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
illegal_macro_output_chars=`~$&|'"<>
use_regexp_matching=0
use_true_regexp_matching=0
admin_email=nagios at localhost
admin_pager=pagenagios at localhost
daemon_dumps_core=0
use_large_installation_tweaks=1
enable_environment_macros=1
#free_child_process_memory=1
#child_processes_fork_twice=1
debug_level=0
debug_verbosity=1
debug_file=/var/log/nagios/nagios.debug
max_debug_file_size=1000000

------------------------------------------------------------------------------
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list