<div>Hi, </div>
<div> </div>
<div>I have decreased the number of incoming passive checks every 5 mins to 3206, but im still seeing the "Resource Temporarily Unavailable" messages. Not all the results are being processed still. Any ides?</div>

<div> </div>
<div>TIA,</div>
<div>Marc<br><br></div>
<div class="gmail_quote">2008/11/22 Marc Ismael <span dir="ltr"><<a href="mailto:marcismael@gmail.com">marcismael@gmail.com</a>></span><br>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<p>Hello Mailing list,</p>
<p>This issue has been bothering me for quite some time, I'm getting a high number of stale passive check alerts. It seems like some passive checks are not being processed. I currently have 6596 incoming passive checks every 5 minutes. The rest of the relevant configuration are as follows:</p>

<p>define service{<br>       name                            template_passive<br>       active_checks_enabled           0<br>       passive_checks_enabled          1<br>       parallelize_check               0<br>       obsess_over_service             0<br>
       check_freshness                 1<br>       freshness_threshold             600<br>       check_command                   check_stale_passive<br>       notifications_enabled           1<br>       event_handler_enabled           0<br>
       flap_detection_enabled          1<br>       failure_prediction_enabled      0<br>       process_perf_data               0<br>       retain_status_information       1<br>       retain_nonstatus_information    1<br>       is_volatile                     0<br>
       check_period                    24x7<br>       max_check_attempts              1<br>       normal_check_interval           1<br>       retry_check_interval            1<br>       contact_groups                  admin<br>
       notification_options            c<br>       notification_interval           0<br>       notification_period             24x7<br>       register                        0<br>       }</p>
<p># nagios.cfg<br>max_check_result_reaper_time=15<br>check_result_reaper_frequency=5<br>service_freshness_check_interval=780<br>host_freshness_check_interval=90<br>status_update_interval=20<br>check_external_commands=1<br>
command_check_interval=-1<br>external_command_buffer_slots=8192<br>event_broker_options=-1<br>use_syslog=0<br>log_notifications=1<br>log_service_retries=1<br>log_host_retries=1<br>log_event_handlers=1<br>log_initial_states=0<br>
log_external_commands=1<br>log_passive_checks=1<br>max_service_check_spread=30<br>max_host_check_spread=30<br>max_concurrent_checks=0<br>max_check_result_file_age=3600<br>cached_host_check_horizon=15<br>cached_service_check_horizon=15<br>
enable_predictive_host_dependency_checks=1<br>enable_predictive_service_dependency_checks=1<br>soft_state_dependencies=0<br>auto_reschedule_checks=0<br>auto_rescheduling_interval=30<br>auto_rescheduling_window=180<br>sleep_time=0.125<br>
service_check_timeout=60<br>host_check_timeout=30<br>event_handler_timeout=30<br>notification_timeout=30<br>ocsp_timeout=5<br>perfdata_timeout=5<br>retain_state_information=1<br>retention_update_interval=60<br>use_retained_program_state=0<br>
use_retained_scheduling_info=1<br>retained_host_attribute_mask=0<br>retained_service_attribute_mask=0<br>retained_process_host_attribute_mask=0<br>retained_process_service_attribute_mask=0<br>retained_contact_host_attribute_mask=0<br>
retained_contact_service_attribute_mask=0<br>interval_length=60<br>use_aggressive_host_checking=0<br>execute_service_checks=1<br>accept_passive_service_checks=1<br>execute_host_checks=1<br>accept_passive_host_checks=1<br>
enable_notifications=1<br>enable_event_handlers=1<br>process_performance_data=0<br>obsess_over_services=0<br>obsess_over_hosts=0<br>translate_passive_host_checks=0<br>passive_host_checks_are_soft=0<br>check_for_orphaned_services=1<br>
check_for_orphaned_hosts=1<br>check_service_freshness=1<br>check_host_freshness=1<br>additional_freshness_latency=15<br>enable_flap_detection=1<br>low_service_flap_threshold=5.0<br>high_service_flap_threshold=20.0<br>low_host_flap_threshold=5.0<br>
high_host_flap_threshold=20.0<br>p1_file=/usr/local/nagios/sbin/p1.pl<br>enable_embedded_perl=1<br>use_embedded_perl_implicitly=1<br>use_regexp_matching=1<br>use_true_regexp_matching=0<br>daemon_dumps_core=0<br>use_large_installation_tweaks=1<br>
enable_environment_macros=0<br>free_child_process_memory=0<br>child_processes_fork_twice=0<br>debug_level=0<br>debug_verbosity=1<br>max_debug_file_size=1000000</p>
<p><br>My current situation: nagios miss/fails to process approximately an average of 600 out of 6596 passive check results every 5 mins.</p>
<p>I admint I don't know nagios that well, I started installing/using nagios only recently, and I don't know where/how to start troubleshooting this. I did install mrtg and did a good amount of trial and error with the config, especially max_check_result_reaper_time and check_result_reaper_frequency, but increasing or decreasing the values of these variables only worsens the current situation.</p>

<p>However, this pstree output looks like a qualified starting point:</p>
<p><br>[root@foobar nagios]# pstree -cpG | grep nagios<br>       †€nagios(7943)€€€{nagios}(7944)</p>
<p>[root@foobar tmp]# strace -s50 -p 7944<br>Process 7944 attached - interrupt to quit<br>poll([{fd=4, events=POLLIN}], 1, 500)   = 0<br>poll([{fd=4, events=POLLIN}], 1, 500)   = 0<br>poll([{fd=4, events=POLLIN}], 1, 500)   = 0<br>
poll([{fd=4, events=POLLIN}], 1, 500)   = 0<br>poll([{fd=4, events=POLLIN}], 1, 500)   = 0<br>poll([{fd=4, events=POLLIN}], 1, 500)   = 0<br>poll([{fd=4, events=POLLIN}], 1, 500)   = 0<br>poll([{fd=4, events=POLLIN}], 1, 500)   = 0<br>
poll([{fd=4, events=POLLIN}], 1, 500)   = 0<br>poll([{fd=4, events=POLLIN}], 1, 500)   = 0<br>poll([{fd=4, events=POLLIN}], 1, 500)   = 0<br>poll([{fd=4, events=POLLIN, revents=POLLIN}], 1, 500) = 1<br>read(4, "[1227291780] PROCESS_SERVICE_CHECK_RESULT;foopet"..., 4096) = 94<br>
read(4, 0x2aaaaaaad000, 4096)           = -1 EAGAIN (Resource temporarily unavailable)<br>poll([{fd=4, events=POLLIN, revents=POLLIN}], 1, 500) = 1<br>read(4, "[1227291780] PROCESS_SERVICE_CHECK_RESULT;foopet"..., 4096) = 92<br>
read(4, 0x2aaaaaaad000, 4096)           = -1 EAGAIN (Resource temporarily unavailable)<br>poll([{fd=4, events=POLLIN, revents=POLLIN}], 1, 500) = 1<br>read(4, "[1227291781] PROCESS_SERVICE_CHECK_RESULT;fooaptm"..., 4096) = 94<br>
read(4, 0x2aaaaaaad000, 4096)           = -1 EAGAIN (Resource temporarily unavailable)<br>poll([{fd=4, events=POLLIN, revents=POLLIN}], 1, 500) = 1<br>read(4, "[1227291781] PROCESS_SERVICE_CHECK_RESULT;fooaptm"..., 4096) = 92<br>
read(4, 0x2aaaaaaad000, 4096)           = -1 EAGAIN (Resource temporarily unavailable)<br>poll([{fd=4, events=POLLIN, revents=POLLIN}], 1, 500) = 1<br>read(4, "[1227291781] PROCESS_SERVICE_CHECK_RESULT;fooapet"..., 4096) = 93<br>
read(4, 0x2aaaaaaad000, 4096)           = -1 EAGAIN (Resource temporarily unavailable)<br>poll([{fd=4, events=POLLIN, revents=POLLIN}], 1, 500) = 1<br>read(4, "[1227291781] PROCESS_SERVICE_CHECK_RESULT;fooapet"..., 4096) = 94<br>
read(4, 0x2aaaaaaad000, 4096)           = -1 EAGAIN (Resource temporarily unavailable)<br>poll([{fd=4, events=POLLIN, revents=POLLIN}], 1, 500) = 1<br>read(4, "[1227291781] PROCESS_SERVICE_CHECK_RESULT;foopet"..., 4096) = 92<br>
read(4, 0x2aaaaaaad000, 4096)           = -1 EAGAIN (Resource temporarily unavailable)<br>poll([{fd=4, events=POLLIN, revents=POLLIN}], 1, 500) = 1<br>read(4, "[1227291781] PROCESS_SERVICE_CHECK_RESULT;fooapet"..., 4096) = 94<br>
read(4, 0x2aaaaaaad000, 4096)           = -1 EAGAIN (Resource temporarily unavailable)<br>poll([{fd=4, events=POLLIN, revents=POLLIN}], 1, 500) = 1</p>
<p>[root@foobar tmp]# ls -l /proc/7944/fd<br>total 0<br>lr-x------ 1 root root 64 Nov 21 13:14 0 -> /dev/null<br>l-wx------ 1 root root 64 Nov 21 13:14 1 -> /dev/null<br>l-wx------ 1 root root 64 Nov 21 13:14 2 -> /dev/null<br>
lrwx------ 1 root root 64 Nov 21 13:14 3 -> /var/run/nagios.pid<br>lrwx------ 1 root root 64 Nov 21 13:14 4 -> /var/log/nagios/rw/nagios.cmd</p>
<p>The "EAGAIN/resource temporarily available" messages, is this normal?<br>If yes, what kind of output do I need to produce in order to verify/abandon my gut feeling that nagios is not processing all results?<br>
if no, any suggestions how to attack the problem?</p>
<p>Thank you in advance.</p>
<p>Regards,<br>Marc</p>
<p>server specs:</p>
<p>[root@foobar tmp]# cat /etc/*release<br>Red Hat Enterprise Linux Server release 5.1 (Tikanga)<br>[root@foobar tmp]# free -m<br>            total       used       free     shared    buffers     cached<br>Mem:         31905      23681       8224          0        553      15672</p>

<p>8 cpus<br>processor       : 7<br>vendor_id       : AuthenticAMD<br>cpu family      : 15<br>model           : 33<br>model name      : AMD Opteron (tm) Processor 880<br>stepping        : 2<br>cpu MHz         : 2400.000<br>
cache size      : 1024 KB</p>
<p>[root@foobar tmp]# /usr/local/nagios/sbin/nagios -v /etc/nagios/nagios.cfg</p>
<p>Nagios 3.0.3<br>Copyright (c) 1999-2008 Ethan Galstad (<a href="http://www.nagios.org/" target="_blank">http://www.nagios.org</a>)<br>Last Modified: 06-25-2008<br>License: GPL</p>
<p>Reading configuration data...</p>
<p>Running pre-flight check on configuration data...</p>
<p>Checking services...<br>       Checked 7491 services.<br>Checking hosts...<br>       Checked 460 hosts.<br>Checking host groups...<br>       Checked 30 host groups.<br>Checking service groups...<br>       Checked 0 service groups.<br>
Checking contacts...<br>       Checked 3 contacts.<br>Checking contact groups...<br>       Checked 3 contact groups.<br>Checking service escalations...<br>       Checked 0 service escalations.<br>Checking service dependencies...<br>
       Checked 0 service dependencies.<br>Checking host escalations...<br>       Checked 0 host escalations.<br>Checking host dependencies...<br>       Checked 0 host dependencies.<br>Checking commands...<br>       Checked 28 commands.<br>
Checking time periods...<br>       Checked 6 time periods.<br>Checking for circular paths between hosts...<br>Checking for circular host and service dependencies...<br>Checking global event handlers...<br>Checking obsessive compulsive processor commands...<br>
Checking misc settings...</p>
<p>Total Warnings: 0<br>Total Errors:   0</p>
<p>Things look okay - No serious problems were detected during the pre-flight check</p></blockquote></div><br>