huge performance problems

Mieden, Rick van der rick.vandermieden at orangemail.nl
Mon Jun 27 13:35:40 CEST 2005


Thanks for the responses, I tweaked it a bit, but still have a bad
latency with 174 hosts and 2360 services. )I tuned it down from 540 sec
to 224 seconds. My plugins are fine, they are really fast on
commandline. I also have noticed that the latency drops to 4 secs if I
have around 1700 services running. So it looks like Nagios has some
problems when the amount of services go over 2000 over something like
that.

I'v read something with the USE_MEMORY_PERFORMANCE_TWEAKS. But even that
option does not do anything better with the latency. I also have read
that there are many people who has far more hosts and services checks
than I have without any performance problems. So I'd love to see their
nagios.cfg, or would like to know what the trick is.

 

Regards,

 

Rick

 

-----Original Message-----
From: Hendrik Baecker [mailto:b00mer at gmx.net] 
Sent: Thursday, June 23, 2005 15:50
To: Mieden, Rick van der
Cc: nagios-users at lists.sourceforge.net
Subject: Re: [Nagios-users] huge performance problems

 

Hi,

one year ago we have had nearly the same performance Problems too.

It seems that the scheduler of nagios roles over itself if the count of
services is to big. Therefore we decided to install another nagios
process with different configs in a different directory. So we splitted
our nagios like our networks. One Nagios (nagios-1) for Network A and
another one (nagios-2) for Network B.

So our count of services per nagios instance was decreased and it runs
so far so good.

All this was under version 1.2.

In the past I posted some questions about our problem but there were no
good answer on it, so today I just only know that it works for us.

So far for this. 
I hope nobody will geek me when I take your post to describe some
problems we now have on testing above doing with different instances on
the same host with nagios 2.02b.

When I fire up my instance "nagios-1" with around 1600 Service Checks it
runs very fine with nearly no latency.
But when I fire up the "nagios-2" with around 1850 services this
instance runs very fast to latencies around 100 seconds.
When I now stop the first instance the latencies on the second one
decrease down to < 5 seconds.

Perhaps some of the developer can tell me if I am right in theory that
(one of) the working thread(s) with the scheduling queue can see the
other scheduling queue? Are the possibly the same?

I am not a programmer but I can think about following: Starting nagios-1
will create the scheduling queue and gives it to RAM. So far so good.
There it is and the worker runs through it and executes the checks.
I am now afraid that when I start my second nagios process this will
also create the scheduling queue into the system RAM but that the two
proceses don't have their own queues... Hope that anybody understand
what I mean.

Best regards
Hendrik

Mieden, Rick van der schrieb: 

We have heavy performance problems with Nagios. We monitor 174 hosts,
with 2255 services and an average latency off 400 seconds!!!! Off course
that's not exceptable.

 

I use perl plugins with ssh and snmp plugins. I'v compiled nagios with
perlcache and embedded-perl enabled. The server is a sparc server with 2
x 1.1 Ghz CPU and 1024 RAM.  (Solaris 8, latest patch-level)

 

I played around with all kind of parameters and read the tuning docs for
nagios. 

 

Below the output of "nagios -s nagios.cfg":

 

Nagios 2.0b3

Copyright (c) 1999-2005 Ethan Galstad (www.nagios.org)

Last Modified: 04-03-2005

License: GPL

 

Projected scheduling information for host and service

checks is listed below.  This information assumes that

you are going to start running Nagios with your current

config files.

 

HOST SCHEDULING INFORMATION

---------------------------

Total hosts:                     174

Total scheduled hosts:           0

Host inter-check delay method:   SMART

Average host check interval:     0.00 sec

Host inter-check delay:          0.00 sec

Max host check spread:           30 min

First scheduled check:           N/A

Last scheduled check:            N/A

 

 

SERVICE SCHEDULING INFORMATION

-------------------------------

Total services:                     2255

Total scheduled services:           2255

Service inter-check delay method:   SMART

Average service check interval:     222.47 sec

Inter-check delay:                  0.10 sec

Interleave factor method:           SMART

Average services per host:          12.96

Service interleave factor:          13

Max service check spread:           30 min

First scheduled check:              Wed Jun 22 15:05:08 2005

Last scheduled check:               Wed Jun 22 15:08:50 2005

 

 

CHECK PROCESSING INFORMATION

----------------------------

Service check reaper interval:      5 sec

Max concurrent service checks:      200

 

 

PERFORMANCE SUGGESTIONS

-----------------------

I have no suggestions - things look okay.

 

 

And a nagiostat output:

 

CURRENT STATUS DATA

----------------------------------------------------

Status File:                          /usr/local/nagios/var/status.dat

Status File Age:                      0d 0h 0m 13s

Status File Version:                  2.0b3

 

Program Running Time:                 0d 32h 0m 13s

 

Total Services:                       2255

Services Checked:                     2255

Services Scheduled:                   2255

Active Service Checks:                2255

Passive Service Checks:               0

Total Service State Change:           0.000 / 5.860 / 0.003 %

Active Service Latency:               386.526 / 414.446 / 394.100 %

Active Service Execution Time:        0.062 / 60.349 / 1.428 sec

Active Service State Change:          0.000 / 5.860 / 0.003 %

Active Services Last 1/5/15/60 min:   155 / 1044 / 2255 / 2255

Passive Service State Change:         0.000 / 0.000 / 0.000 %

Passive Services Last 1/5/15/60 min:  0 / 0 / 0 / 0

Services Ok/Warn/Unk/Crit:            2242 / 0 / 0 / 13

Services Flapping:                    0

Services In Downtime:                 0

 

Total Hosts:                          174

Hosts Checked:                        174

Hosts Scheduled:                      0

Active Host Checks:                   174

Passive Host Checks:                  0

Total Host State Change:              0.000 / 0.000 / 0.000 %

Active Host Latency:                  0.000 / 0.000 / 0.000 %

Active Host Execution Time:           0.137 / 1.109 / 0.582 sec

Active Host State Change:             0.000 / 0.000 / 0.000 %

Active Hosts Last 1/5/15/60 min:      1 / 2 / 2 / 9

Passive Host State Change:            0.000 / 0.000 / 0.000 %

Passive Hosts Last 1/5/15/60 min:     0 / 0 / 0 / 0

Hosts Up/Down/Unreach:                174 / 0 / 0

Hosts Flapping:                       0

Hosts In Downtime:                    0

 

 

Anybody an idea what went wrong here? There must be something......

 

Regards,

 

Rick

 

 

 

 

 

 

===========================================================

De informatie opgenomen in dit bericht kan vertrouwelijk zijn en is
alleen bestemd voor de geadresseerde. Indien u dit bericht onterecht
ontvangt, wordt u verzocht de inhoud niet te gebruiken en de afzender
direct te informeren door het bericht te retourneren. Hoewel Orange
maatregelen heeft genomen om virussen in deze email of attachments te
voorkomen, dient u ook zelf na te gaan of virussen aanwezig zijn
aangezien Orange niet aansprakelijk is voor computervirussen die
veroorzaakt zijn door deze email.

The information contained in this message may be confidential and is
intended to be only for the addressee. Should you receive this message
unintentionally, please do not use the contents herein and notify the
sender immediately by return e-mail. Although Orange has taken steps to
ensure that this email and attachments are free from any virus, you do
need to verify the possibility of their existence as Orange can take no
responsibility for any computer virus which might be transferred by way
of this email.

===========================================================

 



===========================================================

De informatie opgenomen in dit bericht kan vertrouwelijk zijn en is alleen bestemd voor de geadresseerde. Indien u dit bericht onterecht ontvangt, wordt u verzocht de inhoud niet te gebruiken en de afzender direct te informeren door het bericht te retourneren. Hoewel Orange maatregelen heeft genomen om virussen in deze email of attachments te voorkomen, dient u ook zelf na te gaan of virussen aanwezig zijn aangezien Orange niet aansprakelijk is voor computervirussen die veroorzaakt zijn door deze email.

The information contained in this message may be confidential and is intended to be only for the addressee. Should you receive this message unintentionally, please do not use the contents herein and notify the sender immediately by return e-mail. Although Orange has taken steps to ensure that this email and attachments are free from any virus, you do need to verify the possibility of their existence as Orange can take no responsibility for any computer virus which might be transferred by way of this email.

===========================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20050627/58a69535/attachment.html>


More information about the Users mailing list