[Nagios-users] distributed monitoring/central server performance problems

Jason Lancaster jlancaster at affinity.com
Fri May 9 05:20:07 CEST 2003
Previous message: NRPE on Windows - Looking for testers...
Next message: start-stop windows-nt services in eventhandlers
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Most of what I said below presumed Nagios was built with "smart" ocsp
functionality built in. The more I think about this, "smart" ocsp
functionality can only be accomplished with a separate daemon or
process, exactly as I have done with the perl script that reads the ocsp
log at given intervals and then forks them into multiple send_nsca
processes. Otherwise, the general check process has to be drawn out
exactly like I experienced. This might be a really good feature for
Nagios 2.0. I cc'd this to nagios-devel for this very reason.

As a follow-up to my last email, I attempted to execute each individual
service (like the default ocsp command shell script on nagios.org)
through send_nsca in the consolidated send_nsca daemon and everything
worked quite well. I'm very satisfied with my results. The most
efficient settings for my implementation seem to be parsing the ocsp.log
every 3 seconds and then forking every 4000 bytes into a send_nsca
process. If you have any questions about what I did please let me know.

-Jason

-----Original Message-----
From: nagios-users-admin at lists.sourceforge.net
[mailto:nagios-users-admin at lists.sourceforge.net] On Behalf Of Jason
Lancaster
Sent: Thursday, May 08, 2003 14:33
To: Ethan Galstad; nagios-users at lists.sourceforge.net
Subject: Re: [Nagios-users] distributed monitoring/central server
performance problems

Ethan and list,
I agree the check command interval I was using may not have been the
most
efficient and I likely would have eventually seen a problem on the
central
server parsing these external commands. After making my first post, I
realized my issue is with regards to how Nagios manages an outgoing ocsp
command in Nagios.

I came to the assumption above by experiencing the following:
In my "non-distributed" test environment, I have 2683 service checks.
I'm
using OCSP with ocsp_timeout=3. This OCSP does not go to any other
systems,
it is just a simple echo, "echo $1 $2 $3 $4 >> ocsp.log." Looking at the
webpage and the status.log, things are updating within a 10-15 minute
interval. This is the behavior I expect and works quite well.

Complicate this echo with a slightly longer to execute command by adding
a
"sleep 3" into the mix and I start having problems. Service and host
update
intervals go from approximately 10 minutes to 15, to 20, to 30... to
never
getting updated. The system stops executing active checks of any type,
including freshness. Nagios becomes useless at this point.

If I comment my sleep line out at this point, Nagios begins to sync back
to
it's normal 10 minute intervals.

I don't know where the problem lies, it very well could be the way I
have
Nagios configured. Personally, I theorize this is due to how Nagios
decides
to manage it's ocsp commands, perhaps if one ocsp command takes a long
time
to execute Nagios thinks that everything needs more execution time. I
don't
know much about C and I don't know the source well, but I'm more than
willing to work with anyone who wants more information on this issue.

I've pretty much given up on handling any advanced ocsp methods within
Nagios and made Nagios execute the ocsp command as quickly as possible
using
a simple bash echo script a fifo. I have to keep in mind important
factors
Ethan discussed with his last reply, therefore I'm sending consolidated
nsca
results at 10 second intervals. This could be lowered to make the
consolidated NSCA parser send each service result through NSCA (just
like
the default ocsp behavior of Nagios in a distributed environment). This
may
in fact work. I have yet to test it but I think I will soon.

Thanks,
Jason

----- Original Message ----- 
From: "Ethan Galstad" <nagios at nagios.org>
To: <nagios-users at lists.sourceforge.net>
Sent: Tuesday, May 06, 2003 19:24
Subject: RE: [Nagios-users] distributed monitoring/central server
performance problems


> NSCA may be to blame (consolidated transmits would help), but it is
> more likely that you are experiencing a bottleneck with the external
> command file.  This file is implemented as a named pipe, which (under
> Linux) has a size of 4K.  If one external command for a passive check
> is ~100 byes, that means you can fit about 40 passive checks into the
> pipe before it fills up.  Your config snippet indicates that you are
> checking for external commands every 30 seconds.  That's way too long
> of an interval - Nagios will only process ~1.5 passive checks per
> second at that rate (you've got ~10 per second incoming).  Try
> setting the command check interval to 3 seconds (or -1) and see if
> that helps.
>
> Nagios 2.0 should be able to handle this much better than 1.0, as
> I've written in a dedicated thread that continuously reads from the
> command file and buffers the input for later handling.  By default,
> this should allow you to handle ~512+ passive checks per command
> check interval.
>
>
> On 6 May 2003 at 1:13, Jason Lancaster wrote:
>
> > Sg,
> > Thanks for the input on this... I believe what I've run into is a
> > bottleneck with send_nsca and the general ocsp command. 3200
services
> > all using separate send_nsca commands on an average check_interval
of
> > 5 minutes makes for a very crazy system; almost 10 incoming/outgoing
> > nsca connections a second.
> >
> > I've tested this theory by disabling the oscp command and using
> > simpler commands (such as an echo to a log file) with success.
> >
> > I'm going to write an nsca "sweeper" tomorrow and see how sending
> > increments of the oscp commands (50 at a time, 100 at a time, etc)
> > through a single nsca connection to the central server are
processed.
> >
> > I'll keep posting and be sure to let everyone know if I solve it.
> >
> > -Jason
> >
> > -----Original Message-----
> > From: nagios-users-admin at lists.sourceforge.net
> > [mailto:nagios-users-admin at lists.sourceforge.net] On Behalf Of
> > Subhendu Ghosh Sent: Monday, May 05, 2003 17:26 To:
> > nagios-users at lists.sourceforge.net Subject: Re: [Nagios-users]
> > distributed monitoring/central server performance problems
> >
> > Is there any way for you run a debugging version of Nagios (DEBUG3)
> > and/or strace..
> >
> > -sg
> >
> > On Mon, 5 May 2003, Jason Lancaster wrote:
> >
> > > Hi everyone,
> > > I have somewhat of an update on this situation. I've been able to
> > > get similar results in a non-distributed environment. It may not
or
> > > may
> > not give
> > > anyone ideas but it does simplify the situation.
> > >
> > > Current server: Dual p3 1.2 ghz, 1gb ram, redhat 7.3, nagios 1.0
> > > Approx 2600 services, 313 hosts. 2324 passive checks sent through
> > > nsca and written to external commands
> > file.
> > > 359 active service checks.
> > > Uses cfg file posted previously.
> > >
> > > Host/Service status does not update regularly. All services are
> > > setup
> > to
> > > update within 5 minute periods, but there are results still
pending
> > for
> > > checks made over 45 minutes ago. Stale checks are attempted once
but
> > then
> > > Nagios becomes so bogged that it can't process anything. This is
> > reflected
> > > in the web interface as well as the status.log.
> > >
> > > If anyone uses Nagios in a similar large-scale environment, I'd
> > > really appreciate some input.
> > >
> > > Thanks,
> > > Jason
> > >
> > > ----- Original Message ----- 
> > > From: "Jason Lancaster" <jason at skynetweb.com>
> > > To: <nagios-users at lists.sourceforge.net>
> > > Sent: Friday, May 02, 2003 19:54
> > > Subject: [Nagios-users] distributed monitoring/central server
> > performance
> > > problems
> > >
> > >
> > > > A simple background of my environment:
> > > > My central server is receiving external commands from 3
monitoring
> > > servers.
> > > > I have just over 3200 services monitored, all delivered to the
> > central
> > > > server through NSCA. Everything works perfect until the Nagios
> > process on
> > > > the central server attempts to parse the external commands.
> > > >
> > > > When first started, Nagios updates status information (alerts)
> > quickly but
> > > > as time goes on, status updates (alerts) are parsed slower and
> > slower
> > > until
> > > > eventually, nothing happens and only external commands are
> > > > written.
> > This
> > > > cripples nagios and since it is not executing local alerts or
> > > > status updates, it never executes stale_check's or sends out
> > > > notifications.
> > I'm
> > > > left with a webpage that displays results anywhere from 6 hours
> > > > ago
> > to
> > > about
> > > > 15 minutes ago. The odd thing about this is the behavior is
> > completely
> > > > unpredictable, although it sometimes seems like it gives an
> > alphabetical
> > > > priority to the first few letters in the alphabet.
> > > >
> > > > If the above confuses you, perhaps a snip from the log might
help:
> > > > [1051917099] EXTERNAL COMMAND:
> > > > PROCESS_SERVICE_CHECK_RESULT;hostname.domain.com;PING;0;PING OK
-
> > Packet
> > > > loss = 0%, RTA = 0.80 ms
> > > > **repeat external command lines hundreds of times, with the
> > following line
> > > > below happening about 20-30 minutes after the external command**
> > > > [1051917099] SERVICE ALERT:
hostname.domain.com;PING;OK;HARD;1;OK
> > > > -
> > Packet
> > > > loss = 0%, RTA = 0.80 ms
> > > >
> > > > The central server is far from being overworked with a load
> > > > average
> > of
> > > 0.04
> > > > and both cpu's average about 96% idle. I can in no way attribute
> > this
> > > > behavior to the hardware on my central system.
> > > >
> > > > I've gone thought the nagios configuration file and tried almost
> > every
> > > > combination of tweaks including: aggregate updates, aggressive
> > checking,
> > > > orphaned services, inter_check_delay_methods,
> > service_interleave_factors,
> > > > setting up a ramdisk, etc. I've found the *best* settings seem
to
> > > > be
> > the
> > > > "smart" methods but they are FAR from perfect. Nagios still is
> > overrun
> > > with
> > > > the external commands.
> > > >
> > > > I know there have to people who have successfully implemented
> > > > Nagios
> > in a
> > > > large distributed environment and I'm hoping some of you might
> > > > speak
> > up
> > > > about issues you may have had.
> > > >
> > > > I believe this problem has to do with Nagios and my guess is
it's
> > either a
> > > > performance option available in the nagios.cfg or it's something
I
> > have to
> > > > rewrite/set in the source. I've tried most nagios.cfg options
> > available
> > > with
> > > > no luck. I've attached my nagios.cfg just in case someone
notices
> > > > a
> > > blatant
> > > > error (I know everything here is not the most efficient, it's
just
> > what my
> > > > latest "test" used)
> > > >
> > > > Thanks for your time and sorry for the long explanation!
> > > >
> > > > Jason Lancaster
> > > > Intranet Administrator, Affinity Internet
> > > > (954) 334-8203
> > > >
> > > > check_external_commands=1
> > > > command_check_interval=30s
> > > > command_file=/usr/local/nagios/var/rw/nagios.cmd
> > > > comment_file=/usr/local/nagios/var/comment.log
> > > > downtime_file=/usr/local/nagios/var/downtime.log
> > > > lock_file=/usr/local/nagios/var/nagios.lock
> > > > temp_file=/usr/local/nagios/var/nagios.tmp
> > > > log_rotation_method=d
> > > > log_archive_path=/usr/local/nagios/var/archives
> > > > use_syslog=0
> > > > log_notifications=1
> > > > log_service_retries=1
> > > > log_host_retries=1
> > > > log_event_handlers=1
> > > > log_initial_states=1
> > > > log_external_commands=1
> > > > log_passive_service_checks=1
> > > > inter_check_delay_method=n
> > > > service_interleave_factor=1
> > > > max_concurrent_checks=0
> > > > service_reaper_frequency=1
> > > > sleep_time=1
> > > > service_check_timeout=60
> > > > host_check_timeout=30
> > > > event_handler_timeout=30
> > > > notification_timeout=30
> > > > ocsp_timeout=5
> > > > perfdata_timeout=5
> > > > retain_state_information=1
> > > > state_retention_file=/usr/local/nagios/var/status.sav
> > > > retention_update_interval=0
> > > > use_retained_program_state=0
> > > > interval_length=60
> > > > use_agressive_host_checking=0
> > > > execute_service_checks=1
> > > > accept_passive_service_checks=1
> > > > enable_notifications=1
> > > > enable_event_handlers=1
> > > > process_performance_data=0
> > > > obsess_over_services=0
> > > > check_for_orphaned_services=1
> > > > check_service_freshness=1
> > > > freshness_check_interval=600
> > > > aggregate_status_updates=1
> > > > status_update_interval=20
> > > > enable_flap_detection=0
> > > > low_service_flap_threshold=5.0
> > > > high_service_flap_threshold=20.0
> > > > low_host_flap_threshold=5.0
> > > > high_host_flap_threshold=20.0
> > > > date_format=us



-------------------------------------------------------
Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
The only event dedicated to issues related to Linux enterprise solutions
www.enterpriselinuxforum.com
Previous message: NRPE on Windows - Looking for testers...
Next message: start-stop windows-nt services in eventhandlers
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Developers mailing list