distributed monitoring/central server performance problems

Jason Lancaster jlancaster at affinity.com
Mon May 5 21:17:55 CEST 2003


Hi everyone,
I have somewhat of an update on this situation. I've been able to get
similar results in a non-distributed environment. It may not or may not give
anyone ideas but it does simplify the situation.

Current server: Dual p3 1.2 ghz, 1gb ram, redhat 7.3, nagios 1.0
Approx 2600 services, 313 hosts.
2324 passive checks sent through nsca and written to external commands file.
359 active service checks.
Uses cfg file posted previously.

Host/Service status does not update regularly. All services are setup to
update within 5 minute periods, but there are results still pending for
checks made over 45 minutes ago. Stale checks are attempted once but then
Nagios becomes so bogged that it can't process anything. This is reflected
in the web interface as well as the status.log.

If anyone uses Nagios in a similar large-scale environment, I'd really
appreciate some input.

Thanks,
Jason

----- Original Message ----- 
From: "Jason Lancaster" <jason at skynetweb.com>
To: <nagios-users at lists.sourceforge.net>
Sent: Friday, May 02, 2003 19:54
Subject: [Nagios-users] distributed monitoring/central server performance
problems


> A simple background of my environment:
> My central server is receiving external commands from 3 monitoring
servers.
> I have just over 3200 services monitored, all delivered to the central
> server through NSCA. Everything works perfect until the Nagios process on
> the central server attempts to parse the external commands.
>
> When first started, Nagios updates status information (alerts) quickly but
> as time goes on, status updates (alerts) are parsed slower and slower
until
> eventually, nothing happens and only external commands are written. This
> cripples nagios and since it is not executing local alerts or status
> updates, it never executes stale_check's or sends out notifications. I'm
> left with a webpage that displays results anywhere from 6 hours ago to
about
> 15 minutes ago. The odd thing about this is the behavior is completely
> unpredictable, although it sometimes seems like it gives an alphabetical
> priority to the first few letters in the alphabet.
>
> If the above confuses you, perhaps a snip from the log might help:
> [1051917099] EXTERNAL COMMAND:
> PROCESS_SERVICE_CHECK_RESULT;hostname.domain.com;PING;0;PING OK - Packet
> loss = 0%, RTA = 0.80 ms
> **repeat external command lines hundreds of times, with the following line
> below happening about 20-30 minutes after the external command**
> [1051917099] SERVICE ALERT: hostname.domain.com;PING;OK;HARD;1;OK - Packet
> loss = 0%, RTA = 0.80 ms
>
> The central server is far from being overworked with a load average of
0.04
> and both cpu's average about 96% idle. I can in no way attribute this
> behavior to the hardware on my central system.
>
> I've gone thought the nagios configuration file and tried almost every
> combination of tweaks including: aggregate updates, aggressive checking,
> orphaned services, inter_check_delay_methods, service_interleave_factors,
> setting up a ramdisk, etc. I've found the *best* settings seem to be the
> "smart" methods but they are FAR from perfect. Nagios still is overrun
with
> the external commands.
>
> I know there have to people who have successfully implemented Nagios in a
> large distributed environment and I'm hoping some of you might speak up
> about issues you may have had.
>
> I believe this problem has to do with Nagios and my guess is it's either a
> performance option available in the nagios.cfg or it's something I have to
> rewrite/set in the source. I've tried most nagios.cfg options available
with
> no luck. I've attached my nagios.cfg just in case someone notices a
blatant
> error (I know everything here is not the most efficient, it's just what my
> latest "test" used)
>
> Thanks for your time and sorry for the long explanation!
>
> Jason Lancaster
> Intranet Administrator, Affinity Internet
> (954) 334-8203
>
> check_external_commands=1
> command_check_interval=30s
> command_file=/usr/local/nagios/var/rw/nagios.cmd
> comment_file=/usr/local/nagios/var/comment.log
> downtime_file=/usr/local/nagios/var/downtime.log
> lock_file=/usr/local/nagios/var/nagios.lock
> temp_file=/usr/local/nagios/var/nagios.tmp
> log_rotation_method=d
> log_archive_path=/usr/local/nagios/var/archives
> use_syslog=0
> log_notifications=1
> log_service_retries=1
> log_host_retries=1
> log_event_handlers=1
> log_initial_states=1
> log_external_commands=1
> log_passive_service_checks=1
> inter_check_delay_method=n
> service_interleave_factor=1
> max_concurrent_checks=0
> service_reaper_frequency=1
> sleep_time=1
> service_check_timeout=60
> host_check_timeout=30
> event_handler_timeout=30
> notification_timeout=30
> ocsp_timeout=5
> perfdata_timeout=5
> retain_state_information=1
> state_retention_file=/usr/local/nagios/var/status.sav
> retention_update_interval=0
> use_retained_program_state=0
> interval_length=60
> use_agressive_host_checking=0
> execute_service_checks=1
> accept_passive_service_checks=1
> enable_notifications=1
> enable_event_handlers=1
> process_performance_data=0
> obsess_over_services=0
> check_for_orphaned_services=1
> check_service_freshness=1
> freshness_check_interval=600
> aggregate_status_updates=1
> status_update_interval=20
> enable_flap_detection=0
> low_service_flap_threshold=5.0
> high_service_flap_threshold=20.0
> low_host_flap_threshold=5.0
> high_host_flap_threshold=20.0
> date_format=us
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list