distributed monitoring/central server performance problems

Jason Lancaster jlancaster at affinity.com
Thu May 8 20:32:34 CEST 2003


Ethan and list,
I agree the check command interval I was using may not have been the most
efficient and I likely would have eventually seen a problem on the central
server parsing these external commands. After making my first post, I
realized my issue is with regards to how Nagios manages an outgoing ocsp
command in Nagios.

I came to the assumption above by experiencing the following:
In my "non-distributed" test environment, I have 2683 service checks. I'm
using OCSP with ocsp_timeout=3. This OCSP does not go to any other systems,
it is just a simple echo, "echo $1 $2 $3 $4 >> ocsp.log." Looking at the
webpage and the status.log, things are updating within a 10-15 minute
interval. This is the behavior I expect and works quite well.

Complicate this echo with a slightly longer to execute command by adding a
"sleep 3" into the mix and I start having problems. Service and host update
intervals go from approximately 10 minutes to 15, to 20, to 30... to never
getting updated. The system stops executing active checks of any type,
including freshness. Nagios becomes useless at this point.

If I comment my sleep line out at this point, Nagios begins to sync back to
it's normal 10 minute intervals.

I don't know where the problem lies, it very well could be the way I have
Nagios configured. Personally, I theorize this is due to how Nagios decides
to manage it's ocsp commands, perhaps if one ocsp command takes a long time
to execute Nagios thinks that everything needs more execution time. I don't
know much about C and I don't know the source well, but I'm more than
willing to work with anyone who wants more information on this issue.

I've pretty much given up on handling any advanced ocsp methods within
Nagios and made Nagios execute the ocsp command as quickly as possible using
a simple bash echo script a fifo. I have to keep in mind important factors
Ethan discussed with his last reply, therefore I'm sending consolidated nsca
results at 10 second intervals. This could be lowered to make the
consolidated NSCA parser send each service result through NSCA (just like
the default ocsp behavior of Nagios in a distributed environment). This may
in fact work. I have yet to test it but I think I will soon.

Thanks,
Jason

----- Original Message ----- 
From: "Ethan Galstad" <nagios at nagios.org>
To: <nagios-users at lists.sourceforge.net>
Sent: Tuesday, May 06, 2003 19:24
Subject: RE: [Nagios-users] distributed monitoring/central server
performance problems


> NSCA may be to blame (consolidated transmits would help), but it is
> more likely that you are experiencing a bottleneck with the external
> command file.  This file is implemented as a named pipe, which (under
> Linux) has a size of 4K.  If one external command for a passive check
> is ~100 byes, that means you can fit about 40 passive checks into the
> pipe before it fills up.  Your config snippet indicates that you are
> checking for external commands every 30 seconds.  That's way too long
> of an interval - Nagios will only process ~1.5 passive checks per
> second at that rate (you've got ~10 per second incoming).  Try
> setting the command check interval to 3 seconds (or -1) and see if
> that helps.
>
> Nagios 2.0 should be able to handle this much better than 1.0, as
> I've written in a dedicated thread that continuously reads from the
> command file and buffers the input for later handling.  By default,
> this should allow you to handle ~512+ passive checks per command
> check interval.
>
>
> On 6 May 2003 at 1:13, Jason Lancaster wrote:
>
> > Sg,
> > Thanks for the input on this... I believe what I've run into is a
> > bottleneck with send_nsca and the general ocsp command. 3200 services
> > all using separate send_nsca commands on an average check_interval of
> > 5 minutes makes for a very crazy system; almost 10 incoming/outgoing
> > nsca connections a second.
> >
> > I've tested this theory by disabling the oscp command and using
> > simpler commands (such as an echo to a log file) with success.
> >
> > I'm going to write an nsca "sweeper" tomorrow and see how sending
> > increments of the oscp commands (50 at a time, 100 at a time, etc)
> > through a single nsca connection to the central server are processed.
> >
> > I'll keep posting and be sure to let everyone know if I solve it.
> >
> > -Jason
> >
> > -----Original Message-----
> > From: nagios-users-admin at lists.sourceforge.net
> > [mailto:nagios-users-admin at lists.sourceforge.net] On Behalf Of
> > Subhendu Ghosh Sent: Monday, May 05, 2003 17:26 To:
> > nagios-users at lists.sourceforge.net Subject: Re: [Nagios-users]
> > distributed monitoring/central server performance problems
> >
> > Is there any way for you run a debugging version of Nagios (DEBUG3)
> > and/or strace..
> >
> > -sg
> >
> > On Mon, 5 May 2003, Jason Lancaster wrote:
> >
> > > Hi everyone,
> > > I have somewhat of an update on this situation. I've been able to
> > > get similar results in a non-distributed environment. It may not or
> > > may
> > not give
> > > anyone ideas but it does simplify the situation.
> > >
> > > Current server: Dual p3 1.2 ghz, 1gb ram, redhat 7.3, nagios 1.0
> > > Approx 2600 services, 313 hosts. 2324 passive checks sent through
> > > nsca and written to external commands
> > file.
> > > 359 active service checks.
> > > Uses cfg file posted previously.
> > >
> > > Host/Service status does not update regularly. All services are
> > > setup
> > to
> > > update within 5 minute periods, but there are results still pending
> > for
> > > checks made over 45 minutes ago. Stale checks are attempted once but
> > then
> > > Nagios becomes so bogged that it can't process anything. This is
> > reflected
> > > in the web interface as well as the status.log.
> > >
> > > If anyone uses Nagios in a similar large-scale environment, I'd
> > > really appreciate some input.
> > >
> > > Thanks,
> > > Jason
> > >
> > > ----- Original Message ----- 
> > > From: "Jason Lancaster" <jason at skynetweb.com>
> > > To: <nagios-users at lists.sourceforge.net>
> > > Sent: Friday, May 02, 2003 19:54
> > > Subject: [Nagios-users] distributed monitoring/central server
> > performance
> > > problems
> > >
> > >
> > > > A simple background of my environment:
> > > > My central server is receiving external commands from 3 monitoring
> > > servers.
> > > > I have just over 3200 services monitored, all delivered to the
> > central
> > > > server through NSCA. Everything works perfect until the Nagios
> > process on
> > > > the central server attempts to parse the external commands.
> > > >
> > > > When first started, Nagios updates status information (alerts)
> > quickly but
> > > > as time goes on, status updates (alerts) are parsed slower and
> > slower
> > > until
> > > > eventually, nothing happens and only external commands are
> > > > written.
> > This
> > > > cripples nagios and since it is not executing local alerts or
> > > > status updates, it never executes stale_check's or sends out
> > > > notifications.
> > I'm
> > > > left with a webpage that displays results anywhere from 6 hours
> > > > ago
> > to
> > > about
> > > > 15 minutes ago. The odd thing about this is the behavior is
> > completely
> > > > unpredictable, although it sometimes seems like it gives an
> > alphabetical
> > > > priority to the first few letters in the alphabet.
> > > >
> > > > If the above confuses you, perhaps a snip from the log might help:
> > > > [1051917099] EXTERNAL COMMAND:
> > > > PROCESS_SERVICE_CHECK_RESULT;hostname.domain.com;PING;0;PING OK -
> > Packet
> > > > loss = 0%, RTA = 0.80 ms
> > > > **repeat external command lines hundreds of times, with the
> > following line
> > > > below happening about 20-30 minutes after the external command**
> > > > [1051917099] SERVICE ALERT: hostname.domain.com;PING;OK;HARD;1;OK
> > > > -
> > Packet
> > > > loss = 0%, RTA = 0.80 ms
> > > >
> > > > The central server is far from being overworked with a load
> > > > average
> > of
> > > 0.04
> > > > and both cpu's average about 96% idle. I can in no way attribute
> > this
> > > > behavior to the hardware on my central system.
> > > >
> > > > I've gone thought the nagios configuration file and tried almost
> > every
> > > > combination of tweaks including: aggregate updates, aggressive
> > checking,
> > > > orphaned services, inter_check_delay_methods,
> > service_interleave_factors,
> > > > setting up a ramdisk, etc. I've found the *best* settings seem to
> > > > be
> > the
> > > > "smart" methods but they are FAR from perfect. Nagios still is
> > overrun
> > > with
> > > > the external commands.
> > > >
> > > > I know there have to people who have successfully implemented
> > > > Nagios
> > in a
> > > > large distributed environment and I'm hoping some of you might
> > > > speak
> > up
> > > > about issues you may have had.
> > > >
> > > > I believe this problem has to do with Nagios and my guess is it's
> > either a
> > > > performance option available in the nagios.cfg or it's something I
> > have to
> > > > rewrite/set in the source. I've tried most nagios.cfg options
> > available
> > > with
> > > > no luck. I've attached my nagios.cfg just in case someone notices
> > > > a
> > > blatant
> > > > error (I know everything here is not the most efficient, it's just
> > what my
> > > > latest "test" used)
> > > >
> > > > Thanks for your time and sorry for the long explanation!
> > > >
> > > > Jason Lancaster
> > > > Intranet Administrator, Affinity Internet
> > > > (954) 334-8203
> > > >
> > > > check_external_commands=1
> > > > command_check_interval=30s
> > > > command_file=/usr/local/nagios/var/rw/nagios.cmd
> > > > comment_file=/usr/local/nagios/var/comment.log
> > > > downtime_file=/usr/local/nagios/var/downtime.log
> > > > lock_file=/usr/local/nagios/var/nagios.lock
> > > > temp_file=/usr/local/nagios/var/nagios.tmp
> > > > log_rotation_method=d
> > > > log_archive_path=/usr/local/nagios/var/archives
> > > > use_syslog=0
> > > > log_notifications=1
> > > > log_service_retries=1
> > > > log_host_retries=1
> > > > log_event_handlers=1
> > > > log_initial_states=1
> > > > log_external_commands=1
> > > > log_passive_service_checks=1
> > > > inter_check_delay_method=n
> > > > service_interleave_factor=1
> > > > max_concurrent_checks=0
> > > > service_reaper_frequency=1
> > > > sleep_time=1
> > > > service_check_timeout=60
> > > > host_check_timeout=30
> > > > event_handler_timeout=30
> > > > notification_timeout=30
> > > > ocsp_timeout=5
> > > > perfdata_timeout=5
> > > > retain_state_information=1
> > > > state_retention_file=/usr/local/nagios/var/status.sav
> > > > retention_update_interval=0
> > > > use_retained_program_state=0
> > > > interval_length=60
> > > > use_agressive_host_checking=0
> > > > execute_service_checks=1
> > > > accept_passive_service_checks=1
> > > > enable_notifications=1
> > > > enable_event_handlers=1
> > > > process_performance_data=0
> > > > obsess_over_services=0
> > > > check_for_orphaned_services=1
> > > > check_service_freshness=1
> > > > freshness_check_interval=600
> > > > aggregate_status_updates=1
> > > > status_update_interval=20
> > > > enable_flap_detection=0
> > > > low_service_flap_threshold=5.0
> > > > high_service_flap_threshold=20.0
> > > > low_host_flap_threshold=5.0
> > > > high_host_flap_threshold=20.0
> > > > date_format=us
> >
> >
> >
> > -------------------------------------------------------
> > This sf.net email is sponsored by:ThinkGeek
> > Welcome to geek heaven.
> > http://thinkgeek.com/sf
> > _______________________________________________
> > Nagios-users mailing list
> > Nagios-users at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nagios-users
> > ::: Please include Nagios version, plugin version (-v) and OS when
> > reporting any issue. ::: Messages without supporting info will risk
> > being sent to /dev/null
> >
>
>
>
> Ethan Galstad,
> Nagios Developer
> ---
> Email: nagios at nagios.org
> Website: http://www.nagios.org
>
>
>
> -------------------------------------------------------
> Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
> The only event dedicated to issues related to Linux enterprise solutions
> www.enterpriselinuxforum.com
>
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>



-------------------------------------------------------
Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
The only event dedicated to issues related to Linux enterprise solutions
www.enterpriselinuxforum.com

_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list