huge performance problems

Hendrik Baecker b00mer at gmx.net
Thu Jun 23 15:49:47 CEST 2005


Hi,

one year ago we have had nearly the same performance Problems too.

It seems that the scheduler of nagios roles over itself if the count of
services is to big. Therefore we decided to install another nagios
process with different configs in a different directory. So we splitted
our nagios like our networks. One Nagios (nagios-1) for Network A and
another one (nagios-2) for Network B.

So our count of services per nagios instance was decreased and it runs
so far so good.

All this was under version 1.2.

In the past I posted some questions about our problem but there were no
good answer on it, so today I just only know that it works for us.

So far for this.
I hope nobody will geek me when I take your post to describe some
problems we now have on testing above doing with different instances on
the same host with nagios 2.02b.

When I fire up my instance "nagios-1" with around 1600 Service Checks it
runs very fine with nearly no latency.
But when I fire up the "nagios-2" with around 1850 services this
instance runs very fast to latencies around 100 seconds.
When I now stop the first instance the latencies on the second one
decrease down to < 5 seconds.

Perhaps some of the developer can tell me if I am right in theory that
(one of) the working thread(s) with the scheduling queue can see the
other scheduling queue? Are the possibly the same?

I am not a programmer but I can think about following: Starting nagios-1
will create the scheduling queue and gives it to RAM. So far so good.
There it is and the worker runs through it and executes the checks.
I am now afraid that when I start my second nagios process this will
also create the scheduling queue into the system RAM but that the two
proceses don't have their own queues... Hope that anybody understand
what I mean.

Best regards
Hendrik

Mieden, Rick van der schrieb:

> We have heavy performance problems with Nagios. We monitor 174 hosts,
> with 2255 services and an average latency off 400 seconds!!!! Off
> course that's not exceptable.
>
>  
>
> I use perl plugins with ssh and snmp plugins. I'v compiled nagios with
> perlcache and embedded-perl enabled. The server is a sparc server with
> 2 x 1.1 Ghz CPU and 1024 RAM.  (Solaris 8, latest patch-level)
>
>  
>
> I played around with all kind of parameters and read the tuning docs
> for nagios.
>
>  
>
> Below the output of "nagios -s nagios.cfg":
>
>  
>
> Nagios 2.0b3
>
> Copyright (c) 1999-2005 Ethan Galstad (www.nagios.org)
>
> Last Modified: 04-03-2005
>
> License: GPL
>
>  
>
> Projected scheduling information for host and service
>
> checks is listed below.  This information assumes that
>
> you are going to start running Nagios with your current
>
> config files.
>
>  
>
> HOST SCHEDULING INFORMATION
>
> ---------------------------
>
> Total hosts:                     174
>
> Total scheduled hosts:           0
>
> Host inter-check delay method:   SMART
>
> Average host check interval:     0.00 sec
>
> Host inter-check delay:          0.00 sec
>
> Max host check spread:           30 min
>
> First scheduled check:           N/A
>
> Last scheduled check:            N/A
>
>  
>
>  
>
> SERVICE SCHEDULING INFORMATION
>
> -------------------------------
>
> Total services:                     2255
>
> Total scheduled services:           2255
>
> Service inter-check delay method:   SMART
>
> Average service check interval:     222.47 sec
>
> Inter-check delay:                  0.10 sec
>
> Interleave factor method:           SMART
>
> Average services per host:          12.96
>
> Service interleave factor:          13
>
> Max service check spread:           30 min
>
> First scheduled check:              Wed Jun 22 15:05:08 2005
>
> Last scheduled check:               Wed Jun 22 15:08:50 2005
>
>  
>
>  
>
> CHECK PROCESSING INFORMATION
>
> ----------------------------
>
> Service check reaper interval:      5 sec
>
> Max concurrent service checks:      200
>
>  
>
>  
>
> PERFORMANCE SUGGESTIONS
>
> -----------------------
>
> I have no suggestions - things look okay.
>
>  
>
>  
>
> And a nagiostat output:
>
>  
>
> CURRENT STATUS DATA
>
> ----------------------------------------------------
>
> Status File:                          /usr/local/nagios/var/status.dat
>
> Status File Age:                      0d 0h 0m 13s
>
> Status File Version:                  2.0b3
>
>  
>
> Program Running Time:                 0d 32h 0m 13s
>
>  
>
> Total Services:                       2255
>
> Services Checked:                     2255
>
> Services Scheduled:                   2255
>
> Active Service Checks:                2255
>
> Passive Service Checks:               0
>
> Total Service State Change:           0.000 / 5.860 / 0.003 %
>
> *Active Service Latency:               386.526 / 414.446 / 394.100 %*
>
> Active Service Execution Time:        0.062 / 60.349 / 1.428 sec
>
> Active Service State Change:          0.000 / 5.860 / 0.003 %
>
> *Active Services Last 1/5/15/60 min:   155 / 1044 / 2255 / 2255*
>
> Passive Service State Change:         0.000 / 0.000 / 0.000 %
>
> Passive Services Last 1/5/15/60 min:  0 / 0 / 0 / 0
>
> Services Ok/Warn/Unk/Crit:            2242 / 0 / 0 / 13
>
> Services Flapping:                    0
>
> Services In Downtime:                 0
>
>  
>
> Total Hosts:                          174
>
> Hosts Checked:                        174
>
> Hosts Scheduled:                      0
>
> Active Host Checks:                   174
>
> Passive Host Checks:                  0
>
> Total Host State Change:              0.000 / 0.000 / 0.000 %
>
> Active Host Latency:                  0.000 / 0.000 / 0.000 %
>
> Active Host Execution Time:           0.137 / 1.109 / 0.582 sec
>
> Active Host State Change:             0.000 / 0.000 / 0.000 %
>
> Active Hosts Last 1/5/15/60 min:      1 / 2 / 2 / 9
>
> Passive Host State Change:            0.000 / 0.000 / 0.000 %
>
> Passive Hosts Last 1/5/15/60 min:     0 / 0 / 0 / 0
>
> Hosts Up/Down/Unreach:                174 / 0 / 0
>
> Hosts Flapping:                       0
>
> Hosts In Downtime:                    0
>
>  
>
>  
>
> Anybody an idea what went wrong here? There must be something......
>
>  
>
> Regards,
>
>  
>
> Rick
>
>  
>
>  
>
>  
>
>  
>
>  
>
>  
>
> ===========================================================
>
> De informatie opgenomen in dit bericht kan vertrouwelijk zijn en is
> alleen bestemd voor de geadresseerde. Indien u dit bericht onterecht
> ontvangt, wordt u verzocht de inhoud niet te gebruiken en de afzender
> direct te informeren door het bericht te retourneren. Hoewel Orange
> maatregelen heeft genomen om virussen in deze email of attachments te
> voorkomen, dient u ook zelf na te gaan of virussen aanwezig zijn
> aangezien Orange niet aansprakelijk is voor computervirussen die
> veroorzaakt zijn door deze email.
>
> The information contained in this message may be confidential and is
> intended to be only for the addressee. Should you receive this message
> unintentionally, please do not use the contents herein and notify the
> sender immediately by return e-mail. Although Orange has taken steps
> to ensure that this email and attachments are free from any virus, you
> do need to verify the possibility of their existence as Orange can
> take no responsibility for any computer virus which might be
> transferred by way of this email.
>
> ===========================================================
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20050623/40ab4a1c/attachment.html>


More information about the Users mailing list