distributed monitoring/central server performance problems

Jason Lancaster jlancaster at affinity.com
Tue May 6 07:13:37 CEST 2003


Sg,
Thanks for the input on this... I believe what I've run into is a
bottleneck with send_nsca and the general ocsp command. 3200 services
all using separate send_nsca commands on an average check_interval of 5
minutes makes for a very crazy system; almost 10 incoming/outgoing nsca
connections a second.

I've tested this theory by disabling the oscp command and using simpler
commands (such as an echo to a log file) with success.

I'm going to write an nsca "sweeper" tomorrow and see how sending
increments of the oscp commands (50 at a time, 100 at a time, etc)
through a single nsca connection to the central server are processed.

I'll keep posting and be sure to let everyone know if I solve it.

-Jason

-----Original Message-----
From: nagios-users-admin at lists.sourceforge.net
[mailto:nagios-users-admin at lists.sourceforge.net] On Behalf Of Subhendu
Ghosh
Sent: Monday, May 05, 2003 17:26
To: nagios-users at lists.sourceforge.net
Subject: Re: [Nagios-users] distributed monitoring/central server
performance problems

Is there any way for you run a debugging version of Nagios (DEBUG3)
and/or 
strace..

-sg 

On Mon, 5 May 2003, Jason Lancaster wrote:

> Hi everyone,
> I have somewhat of an update on this situation. I've been able to get
> similar results in a non-distributed environment. It may not or may
not give
> anyone ideas but it does simplify the situation.
> 
> Current server: Dual p3 1.2 ghz, 1gb ram, redhat 7.3, nagios 1.0
> Approx 2600 services, 313 hosts.
> 2324 passive checks sent through nsca and written to external commands
file.
> 359 active service checks.
> Uses cfg file posted previously.
> 
> Host/Service status does not update regularly. All services are setup
to
> update within 5 minute periods, but there are results still pending
for
> checks made over 45 minutes ago. Stale checks are attempted once but
then
> Nagios becomes so bogged that it can't process anything. This is
reflected
> in the web interface as well as the status.log.
> 
> If anyone uses Nagios in a similar large-scale environment, I'd really
> appreciate some input.
> 
> Thanks,
> Jason
> 
> ----- Original Message ----- 
> From: "Jason Lancaster" <jason at skynetweb.com>
> To: <nagios-users at lists.sourceforge.net>
> Sent: Friday, May 02, 2003 19:54
> Subject: [Nagios-users] distributed monitoring/central server
performance
> problems
> 
> 
> > A simple background of my environment:
> > My central server is receiving external commands from 3 monitoring
> servers.
> > I have just over 3200 services monitored, all delivered to the
central
> > server through NSCA. Everything works perfect until the Nagios
process on
> > the central server attempts to parse the external commands.
> >
> > When first started, Nagios updates status information (alerts)
quickly but
> > as time goes on, status updates (alerts) are parsed slower and
slower
> until
> > eventually, nothing happens and only external commands are written.
This
> > cripples nagios and since it is not executing local alerts or status
> > updates, it never executes stale_check's or sends out notifications.
I'm
> > left with a webpage that displays results anywhere from 6 hours ago
to
> about
> > 15 minutes ago. The odd thing about this is the behavior is
completely
> > unpredictable, although it sometimes seems like it gives an
alphabetical
> > priority to the first few letters in the alphabet.
> >
> > If the above confuses you, perhaps a snip from the log might help:
> > [1051917099] EXTERNAL COMMAND:
> > PROCESS_SERVICE_CHECK_RESULT;hostname.domain.com;PING;0;PING OK -
Packet
> > loss = 0%, RTA = 0.80 ms
> > **repeat external command lines hundreds of times, with the
following line
> > below happening about 20-30 minutes after the external command**
> > [1051917099] SERVICE ALERT: hostname.domain.com;PING;OK;HARD;1;OK -
Packet
> > loss = 0%, RTA = 0.80 ms
> >
> > The central server is far from being overworked with a load average
of
> 0.04
> > and both cpu's average about 96% idle. I can in no way attribute
this
> > behavior to the hardware on my central system.
> >
> > I've gone thought the nagios configuration file and tried almost
every
> > combination of tweaks including: aggregate updates, aggressive
checking,
> > orphaned services, inter_check_delay_methods,
service_interleave_factors,
> > setting up a ramdisk, etc. I've found the *best* settings seem to be
the
> > "smart" methods but they are FAR from perfect. Nagios still is
overrun
> with
> > the external commands.
> >
> > I know there have to people who have successfully implemented Nagios
in a
> > large distributed environment and I'm hoping some of you might speak
up
> > about issues you may have had.
> >
> > I believe this problem has to do with Nagios and my guess is it's
either a
> > performance option available in the nagios.cfg or it's something I
have to
> > rewrite/set in the source. I've tried most nagios.cfg options
available
> with
> > no luck. I've attached my nagios.cfg just in case someone notices a
> blatant
> > error (I know everything here is not the most efficient, it's just
what my
> > latest "test" used)
> >
> > Thanks for your time and sorry for the long explanation!
> >
> > Jason Lancaster
> > Intranet Administrator, Affinity Internet
> > (954) 334-8203
> >
> > check_external_commands=1
> > command_check_interval=30s
> > command_file=/usr/local/nagios/var/rw/nagios.cmd
> > comment_file=/usr/local/nagios/var/comment.log
> > downtime_file=/usr/local/nagios/var/downtime.log
> > lock_file=/usr/local/nagios/var/nagios.lock
> > temp_file=/usr/local/nagios/var/nagios.tmp
> > log_rotation_method=d
> > log_archive_path=/usr/local/nagios/var/archives
> > use_syslog=0
> > log_notifications=1
> > log_service_retries=1
> > log_host_retries=1
> > log_event_handlers=1
> > log_initial_states=1
> > log_external_commands=1
> > log_passive_service_checks=1
> > inter_check_delay_method=n
> > service_interleave_factor=1
> > max_concurrent_checks=0
> > service_reaper_frequency=1
> > sleep_time=1
> > service_check_timeout=60
> > host_check_timeout=30
> > event_handler_timeout=30
> > notification_timeout=30
> > ocsp_timeout=5
> > perfdata_timeout=5
> > retain_state_information=1
> > state_retention_file=/usr/local/nagios/var/status.sav
> > retention_update_interval=0
> > use_retained_program_state=0
> > interval_length=60
> > use_agressive_host_checking=0
> > execute_service_checks=1
> > accept_passive_service_checks=1
> > enable_notifications=1
> > enable_event_handlers=1
> > process_performance_data=0
> > obsess_over_services=0
> > check_for_orphaned_services=1
> > check_service_freshness=1
> > freshness_check_interval=600
> > aggregate_status_updates=1
> > status_update_interval=20
> > enable_flap_detection=0
> > low_service_flap_threshold=5.0
> > high_service_flap_threshold=20.0
> > low_host_flap_threshold=5.0
> > high_host_flap_threshold=20.0
> > date_format=us



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list