Large scale installation

Andreas Brandino ampranti at gmail.com
Fri Jun 15 15:57:59 CEST 2012


Thank you for the reply.

My server has Intel(R) Xeon(TM) CPU 2.80GHz (1 core) and 3Gb of RAM.
I have 1300 checks , 320 hosts and mk_livestatus . CPU load is about 55-60%.
Also, one client is always connected to load nagvis maps and specific
checks state (refresh rate is 30 secs).

Checks are performed in various intervals (ranging from 1 minute to 10
minutes).
All plugins are in perl; I think a lot of effort is required to convert
them to C (compiled).
use_large_installation_tweaks is already enabled
(use_large_installation_tweaks=1)

I am not using mysql; whole configuration is text based.

Also this is the report from nagiostats:

Nagios Stats 3.4.1
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 05-11-2012
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /var/log/nagios/status.dat
Status File Age:                        0d 0h 0m 6s
Status File Version:                    3.4.1

Program Running Time:                   0d 2h 24m 37s
Nagios PID:                             11485
Used/High/Total Command Buffers:        0 / 0 / 4096

Total Services:                         1342
Services Checked:                       1342
Services Scheduled:                     1341
Services Actively Checked:              1341
Services Passively Checked:             1
Total Service State Change:             0.000 / 39.540 / 0.076 %
Active Service Latency:                 0.005 / 0.717 / 0.203 sec
Active Service Execution Time:          0.013 / 20.340 / 2.241 sec
Active Service State Change:            0.000 / 11.580 / 0.047 %
Active Services Last 1/5/15/60 min:     199 / 1002 / 1294 / 1328
Passive Service Latency:                34.021 / 34.021 / 34.021 sec
Passive Service State Change:           39.540 / 39.540 / 39.540 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              1278 / 19 / 18 / 27
Services Flapping:                      1
Services In Downtime:                   0

Total Hosts:                            318
Hosts Checked:                          318
Hosts Scheduled:                        318
Hosts Actively Checked:                 318
Host Passively Checked:                 0
Total Host State Change:                0.000 / 0.000 / 0.000 %
Active Host Latency:                    0.017 / 0.446 / 0.195 sec
Active Host Execution Time:             0.019 / 30.026 / 5.615 sec
Active Host State Change:               0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min:        43 / 285 / 318 / 318
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  298 / 18 / 2
Hosts Flapping:                         0
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     51 / 319 / 986
   Scheduled:                           46 / 294 / 915
   On-demand:                           5 / 25 / 71
   Parallel:                            46 / 295 / 921
   Serial:                              0 / 0 / 0
   Cached:                              5 / 24 / 66
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  238 / 1048 / 3138
   Scheduled:                           238 / 1048 / 3138
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I think the above statistics are ok; I want to use a second server (and
move active checks) to keep load under 60% (or even lower that 40%) while
checks increase.

Thank you






2012/6/12 Giorgio Zarrelli <zarrelli at linux.it>

> Hi,
>
> You are right., open files IS a major concern I forgot to mention. A quick
> and dirty method to solve it is to raise the number of open files putting
> ulimit command folllowed by a high value in The Nagios startup script.
>
> ulimit -a will tell The current system wirde ulimit value.
>
> Lucky you, ssd disks are a good improvement!
>
> Ciao,
>
> Giorgio
>
> Il giorno 12/giu/2012, alle ore 03:59, Ian Orszaczki <ian at griggle.net> ha
> scritto:
>
>
> Great advice.  Funny you should mention status.dat in ramdisk as we have
> hit a hiccup this morning which has meant we have lost comments and
> downtimes.
>
> We had moved status.dat to a ramdisk as recommended for large
> installations (we monitoring 3390 hosts with 18748 services from one
> server, latencies below 2 secs and load under 2) but after running out of
> open files the status.dat was zero'd.
>
>
> As an extreme hack I ran a quick script across the output of -
> # grep EXTERNAL nagios.log | grep ACK | cut -c57- > /tmp/acks.txt
>
> Script -
>  #!/bin/sh
>  # This is a sample shell script showing how you can submit the
> ACKNOWLEDGE_HOST_PROBLEM command
>  # to Nagios.  Adjust variables to fit your environment as necessary.
>  now=`date +%s`
>  commandfile='/app/nagios/var/rw/nagios.cmd'
>  cat /tmp/acks.txt | while read line
>  do
>          echo $line
>          /usr/bin/printf "[%lu] $line\n" $now > $commandfile
>  done
>
> Therefore I am going to move status.dat back onto the localdisk (luckily
> SSD drives) so that we can at least restore from a recent backup. I will
> probably also create valid copy, along with retention.dat, every hour to
> enable quick recovery. And yes, I have increased the process and open files
> limits for the nagios user.
>
> Am I missing anything obvious >
>
>
> On Tue, Jun 12, 2012 at 5:40 AM, Giorgio Zarrelli <zarrelli at linux.it>wrote:
>
>> Hi,
>>
>> I suggest to review your installation. Try with the large installation
>> tweaks http://nagios.sourceforge.net/docs/3_0/largeinstalltweaks.html.
>>
>> Then, check whether you need all your checks at 5 mins or you can move
>> some of them to 10 mins pace.
>>
>> Then, review your check plugins: Perl plugins eat more memory and CPU
>> cycles then C compiled checks. If they support EPN
>> http://nagios.sourceforge.net/docs/3_0/embeddedperl.html, use it, it
>> makes
>> your plugin faster and lighter.
>>
>> Then, check your checks. Some checks return data slower then others. Let's
>> say, SNMP checks are not lightning fast.
>>
>> Then, check your graphs. Graphing perfdata takes CPU cycles and uses
>> memory. Do you need all your graphs?
>>
>> Then, get rid of NDOUtils. They are chocking all the way, not efficient,
>> clumsy, old and heavy. If you want to store your data in MySQL, use Merlin
>> instead.
>>
>> Anyway, did you tune your MySQL? Is it causing too much I/O? Is it
>> munching too much RAM or CPU cycles?
>>
>> Did you tune your Apache or http server? Does it cope with your needs? Is
>> it munching too much RAM or CPU cycles?
>>
>> If you want live infos about your hosts and services, let's say to use
>> with Navis, grab MKlive: it's blazing fast and gives you access to the
>> core Nagios process.
>>
>> Are you using a virtualized environment? If so, remember that I/O layer in
>> virtualized environments has a poor performance, use fast and real disks
>> and your I/O will drop dramatically.
>>
>> Try to move status.dat to /dev/shm. The latter is a ram disk ready to use
>> and writing in ram is always faster then writing on disk.
>>
>> Avoid logging too much, it increases I/O and takes CPU and RAM.
>>
>> What iotop and iostat are telling you?
>>
>> What do you see in top or htop?
>>
>> If you can or wish, compile all from sources, it will go faster on your
>> system.
>>
>> You can use passive checks with NSCA or NRDP to reduce load, even though I
>> do not like them a lot.
>>
>> These are just few ideas that came to my mind.
>>
>>
>> Let's talk about sharing load.
>>
>> You can use different methods:
>>
>> Merlin
>> (http://www.op5.org/community/plugin-inventory/op5-projects/merlin):
>> gives
>> you loadbalancing and redundancy. I use it for Ninja, never used for load
>> balancing and redundancy.
>>
>> DNX (http://dnx.sourceforge.net/): Something new, it's gaining momentum,
>> good to offload the checks. Worth to give a try.
>>
>> Mod_gearman (http://labs.consol.de/lang/de/nagios/mod-gearman/): Love at
>> first site :-) Easy, powerful, load balancing and fault tolerant. Compile
>> gearmand with memcached support and all the result checks will go directly
>> to ram, avoiding I/O on disk. It's really simple to setup, if one of the
>> workers go down, the others will share its work. Be careful: security is a
>> problem, there is not a good auth system, but using a VPN will solve the
>> problem. Efficient, I use a virtual machine with 2 cores and 2 gb of ram
>> to make about 5K checks. And the load is not a concern. You need more
>> horse power? Add a worker. You have some checks timing out due to poor
>> connections to the targets? Put a worker close to the target, but be
>> careful, the timing, let's say the rta of a ping, will be from the worker
>> perspective.
>>
>> Well, hope it helps.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Nagios-users mailing list
>> Nagios-users at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nagios-users
>> ::: Please include Nagios version, plugin version (-v) and OS when
>> reporting any issue.
>> ::: Messages without supporting info will risk being sent to /dev/null
>>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20120615/02a68702/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list