distributed monitoring/central server performance problems

Subhendu Ghosh sghosh at sghosh.org
Mon May 5 23:26:11 CEST 2003


Is there any way for you run a debugging version of Nagios (DEBUG3) and/or 
strace..

-sg 

On Mon, 5 May 2003, Jason Lancaster wrote:

> Hi everyone,
> I have somewhat of an update on this situation. I've been able to get
> similar results in a non-distributed environment. It may not or may not give
> anyone ideas but it does simplify the situation.
> 
> Current server: Dual p3 1.2 ghz, 1gb ram, redhat 7.3, nagios 1.0
> Approx 2600 services, 313 hosts.
> 2324 passive checks sent through nsca and written to external commands file.
> 359 active service checks.
> Uses cfg file posted previously.
> 
> Host/Service status does not update regularly. All services are setup to
> update within 5 minute periods, but there are results still pending for
> checks made over 45 minutes ago. Stale checks are attempted once but then
> Nagios becomes so bogged that it can't process anything. This is reflected
> in the web interface as well as the status.log.
> 
> If anyone uses Nagios in a similar large-scale environment, I'd really
> appreciate some input.
> 
> Thanks,
> Jason
> 
> ----- Original Message ----- 
> From: "Jason Lancaster" <jason at skynetweb.com>
> To: <nagios-users at lists.sourceforge.net>
> Sent: Friday, May 02, 2003 19:54
> Subject: [Nagios-users] distributed monitoring/central server performance
> problems
> 
> 
> > A simple background of my environment:
> > My central server is receiving external commands from 3 monitoring
> servers.
> > I have just over 3200 services monitored, all delivered to the central
> > server through NSCA. Everything works perfect until the Nagios process on
> > the central server attempts to parse the external commands.
> >
> > When first started, Nagios updates status information (alerts) quickly but
> > as time goes on, status updates (alerts) are parsed slower and slower
> until
> > eventually, nothing happens and only external commands are written. This
> > cripples nagios and since it is not executing local alerts or status
> > updates, it never executes stale_check's or sends out notifications. I'm
> > left with a webpage that displays results anywhere from 6 hours ago to
> about
> > 15 minutes ago. The odd thing about this is the behavior is completely
> > unpredictable, although it sometimes seems like it gives an alphabetical
> > priority to the first few letters in the alphabet.
> >
> > If the above confuses you, perhaps a snip from the log might help:
> > [1051917099] EXTERNAL COMMAND:
> > PROCESS_SERVICE_CHECK_RESULT;hostname.domain.com;PING;0;PING OK - Packet
> > loss = 0%, RTA = 0.80 ms
> > **repeat external command lines hundreds of times, with the following line
> > below happening about 20-30 minutes after the external command**
> > [1051917099] SERVICE ALERT: hostname.domain.com;PING;OK;HARD;1;OK - Packet
> > loss = 0%, RTA = 0.80 ms
> >
> > The central server is far from being overworked with a load average of
> 0.04
> > and both cpu's average about 96% idle. I can in no way attribute this
> > behavior to the hardware on my central system.
> >
> > I've gone thought the nagios configuration file and tried almost every
> > combination of tweaks including: aggregate updates, aggressive checking,
> > orphaned services, inter_check_delay_methods, service_interleave_factors,
> > setting up a ramdisk, etc. I've found the *best* settings seem to be the
> > "smart" methods but they are FAR from perfect. Nagios still is overrun
> with
> > the external commands.
> >
> > I know there have to people who have successfully implemented Nagios in a
> > large distributed environment and I'm hoping some of you might speak up
> > about issues you may have had.
> >
> > I believe this problem has to do with Nagios and my guess is it's either a
> > performance option available in the nagios.cfg or it's something I have to
> > rewrite/set in the source. I've tried most nagios.cfg options available
> with
> > no luck. I've attached my nagios.cfg just in case someone notices a
> blatant
> > error (I know everything here is not the most efficient, it's just what my
> > latest "test" used)
> >
> > Thanks for your time and sorry for the long explanation!
> >
> > Jason Lancaster
> > Intranet Administrator, Affinity Internet
> > (954) 334-8203
> >
> > check_external_commands=1
> > command_check_interval=30s
> > command_file=/usr/local/nagios/var/rw/nagios.cmd
> > comment_file=/usr/local/nagios/var/comment.log
> > downtime_file=/usr/local/nagios/var/downtime.log
> > lock_file=/usr/local/nagios/var/nagios.lock
> > temp_file=/usr/local/nagios/var/nagios.tmp
> > log_rotation_method=d
> > log_archive_path=/usr/local/nagios/var/archives
> > use_syslog=0
> > log_notifications=1
> > log_service_retries=1
> > log_host_retries=1
> > log_event_handlers=1
> > log_initial_states=1
> > log_external_commands=1
> > log_passive_service_checks=1
> > inter_check_delay_method=n
> > service_interleave_factor=1
> > max_concurrent_checks=0
> > service_reaper_frequency=1
> > sleep_time=1
> > service_check_timeout=60
> > host_check_timeout=30
> > event_handler_timeout=30
> > notification_timeout=30
> > ocsp_timeout=5
> > perfdata_timeout=5
> > retain_state_information=1
> > state_retention_file=/usr/local/nagios/var/status.sav
> > retention_update_interval=0
> > use_retained_program_state=0
> > interval_length=60
> > use_agressive_host_checking=0
> > execute_service_checks=1
> > accept_passive_service_checks=1
> > enable_notifications=1
> > enable_event_handlers=1
> > process_performance_data=0
> > obsess_over_services=0
> > check_for_orphaned_services=1
> > check_service_freshness=1
> > freshness_check_interval=600
> > aggregate_status_updates=1
> > status_update_interval=20
> > enable_flap_detection=0
> > low_service_flap_threshold=5.0
> > high_service_flap_threshold=20.0
> > low_host_flap_threshold=5.0
> > high_host_flap_threshold=20.0
> > date_format=us
> >

-- 




-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list