distributed monitoring/central server performance problems

Ethan Galstad nagios at nagios.org
Wed May 7 01:24:29 CEST 2003
Previous message: distributed monitoring/central server performance problems
Next message: distributed monitoring/central server performance problems
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
NSCA may be to blame (consolidated transmits would help), but it is 
more likely that you are experiencing a bottleneck with the external 
command file.  This file is implemented as a named pipe, which (under 
Linux) has a size of 4K.  If one external command for a passive check 
is ~100 byes, that means you can fit about 40 passive checks into the 
pipe before it fills up.  Your config snippet indicates that you are 
checking for external commands every 30 seconds.  That's way too long 
of an interval - Nagios will only process ~1.5 passive checks per 
second at that rate (you've got ~10 per second incoming).  Try 
setting the command check interval to 3 seconds (or -1) and see if 
that helps.

Nagios 2.0 should be able to handle this much better than 1.0, as 
I've written in a dedicated thread that continuously reads from the 
command file and buffers the input for later handling.  By default, 
this should allow you to handle ~512+ passive checks per command 
check interval.


On 6 May 2003 at 1:13, Jason Lancaster wrote:

> Sg,
> Thanks for the input on this... I believe what I've run into is a
> bottleneck with send_nsca and the general ocsp command. 3200 services
> all using separate send_nsca commands on an average check_interval of
> 5 minutes makes for a very crazy system; almost 10 incoming/outgoing
> nsca connections a second.
> 
> I've tested this theory by disabling the oscp command and using
> simpler commands (such as an echo to a log file) with success.
> 
> I'm going to write an nsca "sweeper" tomorrow and see how sending
> increments of the oscp commands (50 at a time, 100 at a time, etc)
> through a single nsca connection to the central server are processed.
> 
> I'll keep posting and be sure to let everyone know if I solve it.
> 
> -Jason
> 
> -----Original Message-----
> From: nagios-users-admin at lists.sourceforge.net
> [mailto:nagios-users-admin at lists.sourceforge.net] On Behalf Of
> Subhendu Ghosh Sent: Monday, May 05, 2003 17:26 To:
> nagios-users at lists.sourceforge.net Subject: Re: [Nagios-users]
> distributed monitoring/central server performance problems
> 
> Is there any way for you run a debugging version of Nagios (DEBUG3)
> and/or strace..
> 
> -sg 
> 
> On Mon, 5 May 2003, Jason Lancaster wrote:
> 
> > Hi everyone,
> > I have somewhat of an update on this situation. I've been able to
> > get similar results in a non-distributed environment. It may not or
> > may
> not give
> > anyone ideas but it does simplify the situation.
> > 
> > Current server: Dual p3 1.2 ghz, 1gb ram, redhat 7.3, nagios 1.0
> > Approx 2600 services, 313 hosts. 2324 passive checks sent through
> > nsca and written to external commands
> file.
> > 359 active service checks.
> > Uses cfg file posted previously.
> > 
> > Host/Service status does not update regularly. All services are
> > setup
> to
> > update within 5 minute periods, but there are results still pending
> for
> > checks made over 45 minutes ago. Stale checks are attempted once but
> then
> > Nagios becomes so bogged that it can't process anything. This is
> reflected
> > in the web interface as well as the status.log.
> > 
> > If anyone uses Nagios in a similar large-scale environment, I'd
> > really appreciate some input.
> > 
> > Thanks,
> > Jason
> > 
> > ----- Original Message ----- 
> > From: "Jason Lancaster" <jason at skynetweb.com>
> > To: <nagios-users at lists.sourceforge.net>
> > Sent: Friday, May 02, 2003 19:54
> > Subject: [Nagios-users] distributed monitoring/central server
> performance
> > problems
> > 
> > 
> > > A simple background of my environment:
> > > My central server is receiving external commands from 3 monitoring
> > servers.
> > > I have just over 3200 services monitored, all delivered to the
> central
> > > server through NSCA. Everything works perfect until the Nagios
> process on
> > > the central server attempts to parse the external commands.
> > >
> > > When first started, Nagios updates status information (alerts)
> quickly but
> > > as time goes on, status updates (alerts) are parsed slower and
> slower
> > until
> > > eventually, nothing happens and only external commands are
> > > written.
> This
> > > cripples nagios and since it is not executing local alerts or
> > > status updates, it never executes stale_check's or sends out
> > > notifications.
> I'm
> > > left with a webpage that displays results anywhere from 6 hours
> > > ago
> to
> > about
> > > 15 minutes ago. The odd thing about this is the behavior is
> completely
> > > unpredictable, although it sometimes seems like it gives an
> alphabetical
> > > priority to the first few letters in the alphabet.
> > >
> > > If the above confuses you, perhaps a snip from the log might help:
> > > [1051917099] EXTERNAL COMMAND:
> > > PROCESS_SERVICE_CHECK_RESULT;hostname.domain.com;PING;0;PING OK -
> Packet
> > > loss = 0%, RTA = 0.80 ms
> > > **repeat external command lines hundreds of times, with the
> following line
> > > below happening about 20-30 minutes after the external command**
> > > [1051917099] SERVICE ALERT: hostname.domain.com;PING;OK;HARD;1;OK
> > > -
> Packet
> > > loss = 0%, RTA = 0.80 ms
> > >
> > > The central server is far from being overworked with a load
> > > average
> of
> > 0.04
> > > and both cpu's average about 96% idle. I can in no way attribute
> this
> > > behavior to the hardware on my central system.
> > >
> > > I've gone thought the nagios configuration file and tried almost
> every
> > > combination of tweaks including: aggregate updates, aggressive
> checking,
> > > orphaned services, inter_check_delay_methods,
> service_interleave_factors,
> > > setting up a ramdisk, etc. I've found the *best* settings seem to
> > > be
> the
> > > "smart" methods but they are FAR from perfect. Nagios still is
> overrun
> > with
> > > the external commands.
> > >
> > > I know there have to people who have successfully implemented
> > > Nagios
> in a
> > > large distributed environment and I'm hoping some of you might
> > > speak
> up
> > > about issues you may have had.
> > >
> > > I believe this problem has to do with Nagios and my guess is it's
> either a
> > > performance option available in the nagios.cfg or it's something I
> have to
> > > rewrite/set in the source. I've tried most nagios.cfg options
> available
> > with
> > > no luck. I've attached my nagios.cfg just in case someone notices
> > > a
> > blatant
> > > error (I know everything here is not the most efficient, it's just
> what my
> > > latest "test" used)
> > >
> > > Thanks for your time and sorry for the long explanation!
> > >
> > > Jason Lancaster
> > > Intranet Administrator, Affinity Internet
> > > (954) 334-8203
> > >
> > > check_external_commands=1
> > > command_check_interval=30s
> > > command_file=/usr/local/nagios/var/rw/nagios.cmd
> > > comment_file=/usr/local/nagios/var/comment.log
> > > downtime_file=/usr/local/nagios/var/downtime.log
> > > lock_file=/usr/local/nagios/var/nagios.lock
> > > temp_file=/usr/local/nagios/var/nagios.tmp
> > > log_rotation_method=d
> > > log_archive_path=/usr/local/nagios/var/archives
> > > use_syslog=0
> > > log_notifications=1
> > > log_service_retries=1
> > > log_host_retries=1
> > > log_event_handlers=1
> > > log_initial_states=1
> > > log_external_commands=1
> > > log_passive_service_checks=1
> > > inter_check_delay_method=n
> > > service_interleave_factor=1
> > > max_concurrent_checks=0
> > > service_reaper_frequency=1
> > > sleep_time=1
> > > service_check_timeout=60
> > > host_check_timeout=30
> > > event_handler_timeout=30
> > > notification_timeout=30
> > > ocsp_timeout=5
> > > perfdata_timeout=5
> > > retain_state_information=1
> > > state_retention_file=/usr/local/nagios/var/status.sav
> > > retention_update_interval=0
> > > use_retained_program_state=0
> > > interval_length=60
> > > use_agressive_host_checking=0
> > > execute_service_checks=1
> > > accept_passive_service_checks=1
> > > enable_notifications=1
> > > enable_event_handlers=1
> > > process_performance_data=0
> > > obsess_over_services=0
> > > check_for_orphaned_services=1
> > > check_service_freshness=1
> > > freshness_check_interval=600
> > > aggregate_status_updates=1
> > > status_update_interval=20
> > > enable_flap_detection=0
> > > low_service_flap_threshold=5.0
> > > high_service_flap_threshold=20.0
> > > low_host_flap_threshold=5.0
> > > high_host_flap_threshold=20.0
> > > date_format=us
> 
> 
> 
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue. ::: Messages without supporting info will risk
> being sent to /dev/null
> 



Ethan Galstad,
Nagios Developer
---
Email: nagios at nagios.org
Website: http://www.nagios.org



-------------------------------------------------------
Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
The only event dedicated to issues related to Linux enterprise solutions
www.enterpriselinuxforum.com

_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: distributed monitoring/central server performance problems
Next message: distributed monitoring/central server performance problems
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list