Nagios and Gearman - huge environment performance problem

Rodney Ramos rodneyra at gmail.com
Tue Aug 23 22:21:42 CEST 2011


Hi, everybody. Sorry for taking so long to reply, but I was testing what was
suggested.

Well, I put all files (status.dat, checkresults, nagios.tmp, nagios.log etc)
on a ram disk (/dev/shm). I also disabled all brokers module, leaving only
the mod_gearman broker, of course. I disabled flapping detection,
performance processing, everything.

The result: absolutely nothing. No improvement. Nagios still stays with 100%
of CPU. Latency is still big, beteween 250 to 500 sec.

I´ve also tested the parameters "max_concurrent_checks",
"check_result_reaper_frequency" and "max_check_result_reaper_time".

When I´ve changed the max_concurrent_checks from "0" to "200", nagios
process fell down to 30/50%. However, the latency increased a lot, going to
more then 1000 sec!!

I´ve changed the "check_result_reaper_frequency" and
"max_check_result_reaper_time". The first from 10 to 5 s. The second from 30
to 15 sec. No big difference.

I´ve enabled the nagios debug too. I had to increase the debug file size as
it get full very very fast. You can see some lines below.

The conclusion: I think that Nagios is not able to make active checks to so
much hosts and services. It is a limitation of the tool. It has to make so
much processing like scheduling and rescheduling that all the active checks
get delayed. And it is not gearman fault. On the contrary, gearman and
mod_gearman make their jobs very well.

But, as Daniel said, there is one thing that I can´t understand. Why my idle
CPU is with 87%? It´s very weird. Is there something that makes the
performance better? A Nagios or Operation System parameter?

Thank you very much.

===================
Debug output:
===================
[1314129294.322456] [032.0] [pid=31793] ** Service Notification Attempt **
Host: '139874', Service: 'Memoria', Type: 0, Options: 0, Current State: 2,
Last Notification: Wed Dec 31 21:00:00 1969
[1314129294.322461] [001.0] [pid=31793]
check_service_notification_viability()
[1314129294.322464] [001.0] [pid=31793] check_time_against_period()
[1314129294.322469] [032.1] [pid=31793] Notifications are temporarily
disabled for this service, so we won't send one out.
[1314129294.322473] [032.0] [pid=31793] Notification viability test failed.
No notification will be sent out.
[1314129294.322477] [016.1] [pid=31793] Rescheduling next check of service
at Tue Aug 23 17:07:56 2011
[1314129294.322481] [001.0] [pid=31793] get_next_valid_time()
[1314129294.322484] [001.0] [pid=31793] check_time_against_period()
[1314129294.322493] [001.0] [pid=31793] schedule_service_check()
[1314129294.322498] [016.0] [pid=31793] Scheduling a non-forced, active
check of service 'Memoria' on host 'mi139874' @ Tue Aug 23 17:07:56 2011
[1314129294.337171] [001.0] [pid=31793] reschedule_event()
[1314129294.337193] [001.0] [pid=31793] add_event()
[1314129294.337590] [064.1] [pid=31793] Making callbacks (type 8)...
[1314129294.337598] [064.1] [pid=31793] Making callbacks (type 20)...
[1314129294.337605] [064.1] [pid=31793] Making callbacks (type 13)...
[1314129294.337610] [064.1] [pid=31793] Making callbacks (type 20)...
[1314129294.337630] [016.1] [pid=31793] Deleted check result file '(null)'
[1314129294.337652] [016.1] [pid=31793] Handling check result for service
'Memoria' on host '167077'...
[1314129294.337656] [001.0] [pid=31793] handle_async_service_check_result()
[1314129294.337659] [016.0] [pid=31793] ** Handling check result for service
'Memoria' on host 'mi167077'...
[1314129294.337662] [016.1] [pid=31793] HOST: mi167077, SERVICE: Memoria,
CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK:
Yes, RETURN CODE: 0, OUTPUT: OK: physical memory: Total: 3.49G - Used: 914M
(25%) - Free: 2.6G (75%)|'physical memory'=25%;90;95; \n
[1314129294.337693] [016.1] [pid=31793] Service is OK.
[1314129294.337697] [016.1] [pid=31793] Service did not change state.
[1314129294.337707] [016.1] [pid=31793] Rescheduling next check of service
at Tue Aug 23 17:08:06 2011
[1314129294.337710] [001.0] [pid=31793] get_next_valid_time()
[1314129294.337714] [001.0] [pid=31793] check_time_against_period()
[1314129294.337724] [001.0] [pid=31793] schedule_service_check()
[1314129294.337728] [016.0] [pid=31793] Scheduling a non-forced, active
check of service 'Memoria' on host '167077' @ Tue Aug 23 17:08:06 2011
[1314129294.352397] [001.0] [pid=31793] reschedule_event()
[1314129294.352418] [001.0] [pid=31793] add_event()
[1314129294.352603] [064.1] [pid=31793] Making callbacks (type 8)...
[1314129294.352610] [064.1] [pid=31793] Making callbacks (type 20)...
[1314129294.352616] [064.1] [pid=31793] Making callbacks (type 13)...
[1314129294.352622] [064.1] [pid=31793] Making callbacks (type 20)...
[1314129294.352625] [001.0] [pid=31793] check_for_service_flapping()
[1314129294.352629] [016.1] [pid=31793] Checking service 'Memoria' on host
'167077' for flapping...
[1314129294.352633] [001.0] [pid=31793] check_for_host_flapping()
[1314129294.352637] [016.1] [pid=31793] Checking host '167077' for
flapping...
[1314129294.352658] [016.1] [pid=31793] Deleted check result file '(null)'
[1314129294.352679] [016.1] [pid=31793] Handling check result for service
'CPU' on host 'mi139447'...
[1314129294.352683] [001.0] [pid=31793] handle_async_service_check_result()
[1314129294.352686] [016.0] [pid=31793] ** Handling check result for service
'CPU' on host '139447'...
[1314129294.352689] [016.1] [pid=31793] HOST: 139447, SERVICE: CPU, CHECK
TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes,
RETURN CODE: 2, OUTPUT: CHECK_NRPE: Socket timeout after 10 seconds.\n
[1314129294.352702] [016.1] [pid=31793] Service is in a non-OK state!
[1314129294.352706] [016.1] [pid=31793] Host is currently DOWN/UNREACHABLE.
[1314129294.352709] [016.1] [pid=31793] Assuming host is in same state as
before...
[1314129294.352720] [032.0] [pid=31793] ** Host Notification Attempt **
Host: '139447', Type: 0, Options: 0, Current State: 1, Last Notification:
Wed Dec 31 21:00:00 1969
[1314129294.352725] [001.0] [pid=31793] check_host_notification_viability()
[1314129294.352728] [001.0] [pid=31793] check_time_against_period()
[1314129294.352738] [032.1] [pid=31793] Notifications are temporarily
disabled for this host, so we won't send one out.
[1314129294.352742] [032.0] [pid=31793] Notification viability test failed.
No notification will be sent out.
[1314129294.352745] [016.1] [pid=31793] Current/Max Attempt(s): 1/4
[1314129294.352748] [016.1] [pid=31793] Host isn't UP, so we won't retry the
service check...
[1314129294.352762] [001.0] [pid=31793] process_macros()
[1314129294.352766] [2048.1] [pid=31793] **** BEGIN MACRO PROCESSING
***********
[1314129294.352769] [2048.1] [pid=31793] Processing: 'SERVICE ALERT:
mi139447;CPU;$SERVICESTATE$;$SERVICESTATETYPE$;$SERVICEATTEMPT$;CHECK_NRPE:
Socket timeout after 10 seconds.
'
[1314129294.352781] [2048.1] [pid=31793]   Done.  Final output: 'SERVICE
ALERT: mi139447;CPU;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 10
seconds.
'
[1314129294.352785] [2048.1] [pid=31793] **** END MACRO PROCESSING
*************
[1314129294.352831] [064.1] [pid=31793] Making callbacks (type 9)...
[1314129294.352838] [001.0] [pid=31793] handle_service_event()
[1314129294.352841] [064.1] [pid=31793] Making callbacks (type 30)...
[1314129294.352848] [001.0] [pid=31793] run_global_service_event_handler()
[1314129294.352852] [001.0] [pid=31793] check_for_external_commands()
[1314129294.352858] [016.1] [pid=31793] Rescheduling next check of service
at Tue Aug 23 17:07:56 2011
[1314129294.352862] [001.0] [pid=31793] get_next_valid_time()
[1314129294.352865] [001.0] [pid=31793] check_time_against_period()
[1314129294.352871] [001.0] [pid=31793] schedule_service_check()
[1314129294.352876] [016.0] [pid=31793] Scheduling a non-forced, active
check of service 'CPU' on host '139447' @ Tue Aug 23 17:07:56 2011
[1314129294.367552] [001.0] [pid=31793] reschedule_event()
[1314129294.367576] [001.0] [pid=31793] add_event()
[1314129294.367972] [064.1] [pid=31793] Making callbacks (type 8)...
[1314129294.367979] [064.1] [pid=31793] Making callbacks (type 20)...
[1314129294.367984] [064.1] [pid=31793] Making callbacks (type 13)...
[1314129294.367990] [064.1] [pid=31793] Making callbacks (type 20)...
[1314129294.367993] [001.0] [pid=31793] check_for_service_flapping()
[1314129294.367997] [016.1] [pid=31793] Checking service 'CPU' on host
'139447' for flapping...
[1314129294.368001] [001.0] [pid=31793] check_for_host_flapping()
[1314129294.368005] [016.1] [pid=31793] Checking host '139447' for
flapping...
[1314129294.368027] [016.1] [pid=31793] Deleted check result file '(null)'
[1314129294.368049] [016.1] [pid=31793] Handling check result for service
'CPU' on host '139496'...
[1314129294.368053] [001.0] [pid=31793] handle_async_service_check_result()
[1314129294.368057] [016.0] [pid=31793] ** Handling check result for service
'CPU' on host '139496'...
[1314129294.368060] [016.1] [pid=31793] HOST: 139496, SERVICE: CPU, CHECK
TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes,
RETURN CODE: 2, OUTPUT: CHECK_NRPE: Socket timeout after 10 seconds.\n
[1314129294.368075] [016.1] [pid=31793] Service is in a non-OK state!
[1314129294.368079] [016.1] [pid=31793] Host is currently DOWN/UNREACHABLE.
[1314129294.368082] [016.1] [pid=31793] Assuming host is in same state as
before...
[1314129294.368094] [032.0] [pid=31793] ** Host Notification Attempt **
Host: 'mi139496', Type: 0, Options: 0, Current State: 1, Last Notification:
Wed Dec 31 21:00:00 1969
[1314129294.368098] [001.0] [pid=31793] check_host_notification_viability()
[1314129294.368101] [001.0] [pid=31793] check_time_against_period()
[1314129294.368111] [032.1] [pid=31793] Notifications are temporarily
disabled for this host, so we won't send one out.
[1314129294.368115] [032.0] [pid=31793] Notification viability test failed.
No notification will be sent out.
[1314129294.368118] [016.1] [pid=31793] Current/Max Attempt(s): 4/4
[1314129294.368122] [016.1] [pid=31793] Service has reached max number of
rechecks, so we'll handle the error...
[1314129294.368125] [001.0] [pid=31793] check_for_service_flapping()
[1314129294.368128] [016.1] [pid=31793] Checking service 'CPU' on host
'139496' for flapping...
[1314129294.368132] [001.0] [pid=31793] check_for_host_flapping()
[1314129294.368135] [016.1] [pid=31793] Checking host '139496' for
flapping...
[1314129294.368138] [001.0] [pid=31793] service_notification()
[1314129294.368144] [032.0] [pid=31793] ** Service Notification Attempt **
Host: '139496', Service: 'CPU', Type: 0, Options: 0, Current State: 2, Last
Notification: Wed Dec 31 21:00:00 1969
[1314129294.368148] [001.0] [pid=31793]
check_service_notification_viability()
[1314129294.368151] [001.0] [pid=31793] check_time_against_period()
[1314129294.368156] [032.1] [pid=31793] Notifications are temporarily
disabled for this service, so we won't send one out.
[1314129294.368160] [032.0] [pid=31793] Notification viability test failed.
No notification will be sent out.
[1314129294.368165] [016.1] [pid=31793] Rescheduling next check of service
at Tue Aug 23 17:07:56 2011
[1314129294.368168] [001.0] [pid=31793] get_next_valid_time()
[1314129294.368171] [001.0] [pid=31793] check_time_against_period()
[1314129294.368176] [001.0] [pid=31793] schedule_service_check()
[1314129294.368181] [016.0] [pid=31793] Scheduling a non-forced, active
check of service 'CPU' on host 'mi139496' @ Tue Aug 23 17:07:56 2011
[1314129294.382852] [001.0] [pid=31793] reschedule_event()
[1314129294.382875] [001.0] [pid=31793] add_event()
[1314129294.383268] [064.1] [pid=31793] Making callbacks (type 8)...
[1314129294.383275] [064.1] [pid=31793] Making callbacks (type 20)...
[1314129294.383281] [064.1] [pid=31793] Making callbacks (type 13)...
[1314129294.383286] [064.1] [pid=31793] Making callbacks (type 20)...
[1314129294.383320] [016.1] [pid=31793] Deleted check result file '(null)'
[1314129294.383339] [016.1] [pid=31793] Handling check result for service
'Memoria' on host '167028'...
[1314129294.383343] [001.0] [pid=31793] handle_async_service_check_result()
[1314129294.383346] [016.0] [pid=31793] ** Handling check result for service
'Memoria' on host '167028'...
[1314129294.383350] [016.1] [pid=31793] HOST: 167028, SERVICE: Memoria,
CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK:
Yes, RETURN CODE: 0, OUTPUT: OK: physical memory: Total: 3.49G - Used: 856M
(23%) - Free: 2.65G (77%)|'physical memory'=23%;90;95; \n
[1314129294.383366] [016.1] [pid=31793] Service is OK.
[1314129294.383370] [016.1] [pid=31793] Service did not change state.
[1314129294.383380] [016.1] [pid=31793] Rescheduling next check of service
at Tue Aug 23 17:08:06 2011
[1314129294.383383] [001.0] [pid=31793] get_next_valid_time()
[1314129294.383386] [001.0] [pid=31793] check_time_against_period()
[1314129294.383396] [001.0] [pid=31793] schedule_service_check()
[1314129294.383401] [016.0] [pid=31793] Scheduling a non-forced, active
check of service 'Memoria' on host 'mi167028' @ Tue Aug 23 17:08:06 2011
[1314129294.398073] [001.0] [pid=31793] reschedule_event()
[1314129294.398096] [001.0] [pid=31793] add_event()
[1314129294.398268] [064.1] [pid=31793] Making callbacks (type 8)...
[1314129294.398275] [064.1] [pid=31793] Making callbacks (type 20)...
[1314129294.398281] [064.1] [pid=31793] Making callbacks (type 13)...
[1314129294.398287] [064.1] [pid=31793] Making callbacks (type 20)...
[1314129294.398290] [001.0] [pid=31793] check_for_service_flapping()
[1314129294.398293] [016.1] [pid=31793] Checking service 'Memoria' on host
'mi167028' for flapping...
[1314129294.398298] [001.0] [pid=31793] check_for_host_flapping()
[1314129294.398301] [016.1] [pid=31793] Checking host '167028' for
flapping...
[1314129294.398322] [016.1] [pid=31793] Deleted check result file '(null)'
[1314129294.398337] [016.1] [pid=31793] Handling check result for service
'CPU' on host '166384'...
[1314129294.398341] [001.0] [pid=31793] handle_async_service_check_result()
[1314129294.398345] [016.0] [pid=31793] ** Handling check result for service
'CPU' on host '166384'...
[1314129294.398348] [016.1] [pid=31793] HOST: 166384, SERVICE: CPU, CHECK
TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes,
RETURN CODE: 0, OUTPUT: OK: 15m: average load 2%|'15m'=2%;90;95; \n
[1314129294.398363] [016.1] [pid=31793] Service is OK.
[1314129294.398366] [016.1] [pid=31793] Service did not change state.
[1314129294.398376] [016.1] [pid=31793] Rescheduling next check of service
at Tue Aug 23 17:08:06 2011
[1314129294.398379] [001.0] [pid=31793] get_next_valid_time()
[1314129294.398383] [001.0] [pid=31793] check_time_against_period()
[1314129294.398393] [001.0] [pid=31793] schedule_service_check()
[1314129294.398398] [016.0] [pid=31793] Scheduling a non-forced, active
check of service 'CPU' on host '166384' @ Tue Aug 23 17:08:06 2011
[1314129294.413177] [001.0] [pid=31793] reschedule_event()
[1314129294.413202] [001.0] [pid=31793] add_event()
[1314129294.413373] [064.1] [pid=31793] Making callbacks (type 8)...
[1314129294.413380] [064.1] [pid=31793] Making callbacks (type 20)...
[1314129294.413387] [064.1] [pid=31793] Making callbacks (type 13)...
[1314129294.413394] [064.1] [pid=31793] Making callbacks (type 20)...
[1314129294.413397] [001.0] [pid=31793] check_for_service_flapping()
[1314129294.413400] [016.1] [pid=31793] Checking service 'CPU' on host
'166384' for flapping...
[1314129294.413405] [001.0] [pid=31793] check_for_host_flapping()
[1314129294.413409] [016.1] [pid=31793] Checking host '166384' for
flapping...
[1314129294.413432] [016.1] [pid=31793] Deleted check result file '(null)'
[1314129294.413452] [016.1] [pid=31793] Handling check result for service
'CPU' on host '167022'...
[1314129294.413455] [001.0] [pid=31793] handle_async_service_check_result()
[1314129294.413459] [016.0] [pid=31793] ** Handling check result for service
'CPU' on host '167022'...
[1314129294.413476] [016.1] [pid=31793] HOST: 167022, SERVICE: CPU, CHECK
TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes,
RETURN CODE: 0, OUTPUT: OK: 15m: average load 1%|'15m'=1%;90;95; \n
[1314129294.413493] [016.1] [pid=31793] Service is OK.
[1314129294.413497] [016.1] [pid=31793] Service did not change state.
[1314129294.413506] [016.1] [pid=31793] Rescheduling next check of service
at Tue Aug 23 17:08:06 2011
[1314129294.413510] [001.0] [pid=31793] get_next_valid_time()
[1314129294.413514] [001.0] [pid=31793] check_time_against_period()
[1314129294.413523] [001.0] [pid=31793] schedule_service_check()
[1314129294.413528] [016.0] [pid=31793] Scheduling a non-forced, active
check of service 'CPU' on host i167022' @ Tue Aug 23 17:08:06 2011
=================================================

On Mon, Aug 22, 2011 at 7:23 PM, Daniel Wittenberg <
daniel.wittenberg.r0ko at statefarm.com> wrote:

>  What is interesting is your CPU is 87% idle, which indicates to me that
> it’s waiting for something, or not scheduling the checks correctly.  Have
> you tried running in debug mode to see if that indicates anything?  Also
> running in debug on just about any of the plugins can cause this too, just
> in case you have logging turned up on things like nsca, nrpe, pnp4nagios,
> etc.****
>
> ** **
>
> Dan****
>
> ** **
>
> ** **
>
> *From:* Rodney Ramos [mailto:rodneyra at gmail.com]
> *Sent:* Friday, August 19, 2011 4:44 PM
>
> *To:* Nagios Developers List
> *Subject:* Re: [Nagios-devel] Nagios and Gearman - huge environment
> performance problem****
>
> ** **
>
> Thanks, Daniel, but I don´t think that my problem is of hardware. I create
> the ramdisk and the problem is the same:
>  - nagios eating 100% of CPU all the time;
>  - nagios does not distribute the active checks in a smoothly way. It waits
> a long time and make the acitve checks in a burst way. I can see this with
> the gearman_top. The gearmand jobs waiting queue is empty almost all the
> time, but sometimes there is a burst of jobs in the queue. I can´t
> understand this behavior.
>
> Any help would be great. Thanks everybody.
>
> =========
> Top result
> =========
>
> top - 18:40:59 up 106 days, 16:56,  4 users,  load average: 8.52, 6.09,
> 5.42
> Tasks: 215 total,   2 running, 213 sleeping,   0 stopped,   0 zombie
> Cpu(s): 12.5%us,  0.1%sy,  0.0%ni, 87.1%id,  0.3%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:   4916356k total,  1974976k used,  2941380k free,   163240k buffers
> Swap:  4194296k total,    22092k used,  4172204k free,   745100k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  2189 nagios  25   0  492m 255m 1668 R 100.1  5.3  66:54.59 nagios
> 24658 nagios  15   0  561m 116m  676 S  0.7  2.4  62:00.96 gearmand
>
>
>
> ****
>
> On Fri, Aug 19, 2011 at 1:31 PM, Daniel Wittenberg <
> daniel.wittenberg.r0ko at statefarm.com> wrote:****
>
> Well but look at your bi and bo, and then the wa column.  So looks like you
> have some IO Wait which probably means it’s waiting on disk activity to get
> things done, and lots of writing to disk.  Have you looked at adding a
> ramdisk for your checkresults, status.dat, and temp_file?  That should help
> eliminate most of the heavy disk i/o from the nagios perspective.  Since it
> doesn’t look like you are swapping memory you should be able to throw some
> at a ramdisk.  You can probably start with 64MB and watch it, might have to
> go higher depending on your workload.****
>
>  ****
>
> Dan****
>
>  ****
>
> *From:* Rodney Ramos [mailto:rodneyra at gmail.com]
> *Sent:* Friday, August 19, 2011 11:27 AM
> *To:* Nagios Developers List
> *Subject:* Re: [Nagios-devel] Nagios and Gearman - huge environment
> performance problem****
>
>  ****
>
> Hi, Daniel,
>
> As we can see below, I think it is not a hardware problem. The idle CPU is
> beteween 60 and 80 %, very good.
>
> Thank you very much.
>
>
> $ vmstat 5
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
> wa st
>  1  2  22092 3046788 189640 890940    0    0   295  1053    0    0  4  3 83
> 10  0
>  1  2  22092 3032992 189664 904600    0    0  2733  7550 3498 7477 12  1 69
> 18  0
>  1  2  22092 3018240 189668 918632    0    0  2720  4070 2484 5114 13  1 72
> 15  0
>  1  0  22092 3008312 189668 930336    0    0  2332  1534 1932 3825 13  1 73
> 14  0
>  1 18  22092 2979292 189724 945780    0    0  1486 13974 2460 8446 16  2 72
> 10  0
>  1  2  22092 2965244 189736 959228    0    0  2570  9094 3290 7204 13  1 67
> 19  0
>  1  2  22092 2949064 189748 973100    0    0  2820  3040 2798 6639 13  2 68
> 17  0
>  1  6  22092 2936060 189768 987788    0    0  2894  3620 2474 5443 13  1 70
> 16  0
>  1  1  22092 2923320 189780 999708    0    0  2377  2618 2285 4794 13  1 70
> 16  0
>  1  0  22092 2923428 189780 999964    0    0     0  4575 1732 2317 12  1
> 86  1  0
>  1  9  22092 2912192 189784 1005260    0    0   402  4544 1541 3889 14  1
> 82  3  0
>  1  7  22092 2891692 189808 1023020    0    0  2534 13969 3232 9421 14  2
> 66 17  0
>  3  2  22092 2868908 189836 1037064    0    0  2797  4115 3002 7055 30  2
> 54 14  0
>  2  2  22092 2860712 189860 1050376    0    0  2646  3352 2448 5416 16  1
> 67 17  0
>  1  8  22092 2847052 189872 1064036    0    0  2748  3970 2616 5487 13  1
> 69 17  0
>  1  0  22092 3469576 189876 462624    0    0   825  1245 1379 2098 12  1
> 83  5  0
>  1  0  22092 3469248 189884 462720    0    0     4  2631 1552 2599 13  0
> 86  0  0
>  1 20  22092 3449816 189904 482192    0    0  2404  8454 2293 7764 15  2 70
> 12  0
>  1 17  22092 3434856 189912 495636    0    0  2694  8955 3542 8039 13  2 65
> 19  0
>  2  7  22092 3422204 189932 509376    0    0  2742  4059 2685 5826 13  1 68
> 19  0
>  1 13  22092 3407532 189948 522508    0    0  2661  3613 6447 49867 12  4
> 66 17  0
>  0  0  22092 3404484 189968 525964    0    0   669  3338 5317 43602 10  4
> 81  6  0
>  1  0  22092 3402004 189984 525956    0    0     0    14 3637 12700 13  1
> 85  0  0
>  1  0  22092 3398172 190012 526036    0    0     0  3318 3972 12401 14  1
> 85  0  0
>  2  0  22092 3392628 190028 526048    0    0     0  9331 5347 16423 15  3
> 81  1  0
>  4  0  22092 3391704 190048 526060    0    0     0  4270 5785 18736 16  2
> 80  1  0
>  1  1  22092 3391652 190064 526056    0    0     0  4091 4746 14669 16  2
> 82  1  0
>  1  0  22092 3392104 190068 526056    0    0     0  1562 4037 11849 16  1
> 83  0  0
>  3  0  22092 3392304 190084 526168    0    0     1  2532 4618 16418 15  2
> 83  0  0
>  1  7  22092 3386028 190112 531488    0    0   967   363 4194 14941 15  2
> 77  6  0
> On Fri, Aug 19, 2011 at 11:32 AM, Daniel Wittenberg <
> daniel.wittenberg.r0ko at statefarm.com> wrote:
> >
> > One simple thing that might help is just run vmstat for a couple minutes:
> >
> >
> >
> > vmstat 5
> >
> >
> >
> > That can help show if you are hitting some bottlenecks.  Are you using a
> lot of macros in your configs?
> >
> >
> >
> > Dan
> >
> >
> >
> > From: Rodney Ramos [mailto:rodneyra at gmail.com]
> > Sent: Friday, August 19, 2011 9:30 AM
> > To: Nagios Developers List
> > Subject: [Nagios-devel] Nagios and Gearman - huge environment performance
> problem
> >
> >
> >
> > Hi everybody,
> >
> > I´m testing Nagios and Gearman / Mod_Gearman. I´d like to change NSCA
> with this new approach, as it seems easier to configure and has a lot of
> advantages. Besides, NSCA and Nagios freshness mechanism have some problems.
> >
> > Gearman and mod_gearman are working well. I have 30000 hosts and 60000
> services, and it is increasing!
> >
> > Now I´m having problem with Nagios performance, that eats 100% of CPU and
> the host and service latency is very big, around 300 seconds. I think that
> this a Nagios problem, as the gearman_top shows the Job Wainting queue empty
> almost all the time. It seems that Nagios do not send the active checks all
> the time, an once in while it sends a burst of active checks.
> >
> > I have a physical central server, running RHEL, with 4 GB of ram,
> Intel(R) Xeon(R) CPU E5504  @ 2.00GHz (8 CPUs). For the workers I have 9
> virtual servers running RHEL too.
> >
> > I've already set the Nagios parameters to large environment, as
> recommended in the documentation, but it made no difference. Thanks.
> >
> > Nagios Parameters to large environment:
> >
> > - use_large_installation_tweaks=1
> >
> > - enable_environment_macros=0
> >
> > - max_concurrent_checks=0
> >
> > - check_result_reaper_frequency=10
> >
> > Could someone help me? How can I improve Nagios performance to make
> active checks faster?
> >
> > Thank you very much.
> >
> >
> >
> ------------------------------------------------------------------------------
> > Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
> > user administration capabilities and model configuration. Take
> > the hassle out of deploying and managing Subversion and the
> > tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
> > _______________________________________________
> > Nagios-devel mailing list
> > Nagios-devel at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nagios-devel
> >****
>
>
>
> ------------------------------------------------------------------------------
> Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
> user administration capabilities and model configuration. Take
> the hassle out of deploying and managing Subversion and the
> tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel****
>
> ** **
>
>
> ------------------------------------------------------------------------------
> uberSVN's rich system and user administration capabilities and model
> configuration take the hassle out of deploying and managing Subversion and
> the tools developers use with it. Learn more about uberSVN and get a free
> download at:  http://p.sf.net/sfu/wandisco-dev2dev
>
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20110823/242921ef/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system, 
user administration capabilities and model configuration. Take 
the hassle out of deploying and managing Subversion and the 
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list