distributed monitoring/central server performance problems

Jason Lancaster jason at skynetweb.com
Sat May 3 01:54:13 CEST 2003


A simple background of my environment:
My central server is receiving external commands from 3 monitoring servers.
I have just over 3200 services monitored, all delivered to the central
server through NSCA. Everything works perfect until the Nagios process on
the central server attempts to parse the external commands.

When first started, Nagios updates status information (alerts) quickly but
as time goes on, status updates (alerts) are parsed slower and slower until
eventually, nothing happens and only external commands are written. This
cripples nagios and since it is not executing local alerts or status
updates, it never executes stale_check's or sends out notifications. I'm
left with a webpage that displays results anywhere from 6 hours ago to about
15 minutes ago. The odd thing about this is the behavior is completely
unpredictable, although it sometimes seems like it gives an alphabetical
priority to the first few letters in the alphabet.

If the above confuses you, perhaps a snip from the log might help:
[1051917099] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;hostname.domain.com;PING;0;PING OK - Packet
loss = 0%, RTA = 0.80 ms
**repeat external command lines hundreds of times, with the following line
below happening about 20-30 minutes after the external command**
[1051917099] SERVICE ALERT: hostname.domain.com;PING;OK;HARD;1;OK - Packet
loss = 0%, RTA = 0.80 ms

The central server is far from being overworked with a load average of 0.04
and both cpu's average about 96% idle. I can in no way attribute this
behavior to the hardware on my central system.

I've gone thought the nagios configuration file and tried almost every
combination of tweaks including: aggregate updates, aggressive checking,
orphaned services, inter_check_delay_methods, service_interleave_factors,
setting up a ramdisk, etc. I've found the *best* settings seem to be the
"smart" methods but they are FAR from perfect. Nagios still is overrun with
the external commands.

I know there have to people who have successfully implemented Nagios in a
large distributed environment and I'm hoping some of you might speak up
about issues you may have had.

I believe this problem has to do with Nagios and my guess is it's either a
performance option available in the nagios.cfg or it's something I have to
rewrite/set in the source. I've tried most nagios.cfg options available with
no luck. I've attached my nagios.cfg just in case someone notices a blatant
error (I know everything here is not the most efficient, it's just what my
latest "test" used)

Thanks for your time and sorry for the long explanation!

Jason Lancaster
Intranet Administrator, Affinity Internet
(954) 334-8203

check_external_commands=1
command_check_interval=30s
command_file=/usr/local/nagios/var/rw/nagios.cmd
comment_file=/usr/local/nagios/var/comment.log
downtime_file=/usr/local/nagios/var/downtime.log
lock_file=/usr/local/nagios/var/nagios.lock
temp_file=/usr/local/nagios/var/nagios.tmp
log_rotation_method=d
log_archive_path=/usr/local/nagios/var/archives
use_syslog=0
log_notifications=1
log_service_retries=1
log_host_retries=1
log_event_handlers=1
log_initial_states=1
log_external_commands=1
log_passive_service_checks=1
inter_check_delay_method=n
service_interleave_factor=1
max_concurrent_checks=0
service_reaper_frequency=1
sleep_time=1
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=1
state_retention_file=/usr/local/nagios/var/status.sav
retention_update_interval=0
use_retained_program_state=0
interval_length=60
use_agressive_host_checking=0
execute_service_checks=1
accept_passive_service_checks=1
enable_notifications=1
enable_event_handlers=1
process_performance_data=0
obsess_over_services=0
check_for_orphaned_services=1
check_service_freshness=1
freshness_check_interval=600
aggregate_status_updates=1
status_update_interval=20
enable_flap_detection=0
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0
date_format=us



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list