Bug report/workaround -- (was Re: Nagios Performance Data shows checks aren't being completed)

Eli Stair estair at ilm.com
Thu Dec 15 20:35:31 CET 2005


I've been trying to resolve this situation for over a week now without 
taking drastic changes.  2.0b6, all retention data created new (not 
continued from older versions), x86_64, perl cache enabled.

I've had a worsening problem recently, where my monitoring host (which 
is controlling 1003 hosts/8543 services/5257 service dependencies) an 
increasing number of service checks and event handlers were falling 
through the scheduler.  Even after stopping and starting nagios, and 
doing a forced_host_svc_checks the relavent check/responses during the 
several-minute execution pause, these were being skipped or not acted 
upon.  Showing the status in 'view config' confirmed that it was set up 
properly, but events were missed and either not re-scheduled or 
rescheduled but not executed.

The last step I took was to stop nagios a final time last night and zero 
the state file retention.dat (as well as the objects.cache for good 
measure, though it wasn't the  problem).  After starting nagios fresh 
with no notion of previous states, within one hour (my threshhold for 
service/host checks) the entire schedule was executed properly, all 
services that had been in an unhandled 'bad' state for days were 
checked, and the respective event handlers were run and the situation 
rectified.

I have no idea of the cause of this, whether it will happen again or 
not, etc.  I'll be more than happy to provide more details.  I have 
backups of the config and retention files from several periods during 
this period.

I'd really like to help resolve this, as losing the trending data is not 
something I want to do again.  My only concern with this setup is the 
"Warning: Size of service_message struct (528 bytes) is > 
POSIX-guaranteed atomic write size (512 bytes).  Service checks results 
may get lost or mangled!" I get when building 2.0betas on any system I 
have available, I haven't seen this addressed/resolved in any searches 
of archives I've done.

Cheers,

/eli



Eli Stair wrote:
> 
> Corroboration here, I actually have a mail I'm compiling also on the 
> same issue.  2.0b6
> 
> I've got orphaned service checks enabled, unlimited parallel service 
> checks, etc.  If I force a host/svc check through the CGI's or the 
> command file direct they get executed right away... the scheduler just 
> is losing them.
> 
> /eli
> 
> sheeri kritzer wrote:
> 
>> Hi all,
>>
>> My nagios 2.0 installation shows the following under performance
>> information.  There are 99 service checks, and I can't imagine it
>> takes more than an hour to complete all 99.  We've had problems where
>> nagios hasn't found and notified us of problems.  The load on the box
>> is tiny.  nagios -s has no suggestions.  What did I do wrong?
>>
>>  uptime
>>  17:38:38 up 81 days,  9:05,  4 users,  load average: 0.00, 0.00, 0.00
>>
>> Nagios is running, and has been for a while:
>>
>> ps -ef | grep nagios
>> nagios   11160     1  0 Nov14 ?        00:12:32 /usr/bin/nagios -d
>> /etc/nagios/nagios.cfg
>> nagios   22947     1  0 Nov20 ?        00:00:00 nrpe -c 
>> /etc/nagios/nrpe.cfg -d
>>
>> Performance Info:
>>
>> Program-Wide Performance Information
>> Active Service Checks:
>>     
>> Time Frame    Checks Completed
>> <= 1 minute:    1 (1.0%)
>> <= 5 minutes:    58 (58.6%)
>> <= 15 minutes:    60 (60.6%)
>> <= 1 hour:    60 (60.6%)
>> Since program start:      99 (100.0%)
>>     
>> Metric    Min.    Max.    Average
>> Check Execution Time:      0.01 sec    8.71 sec    1.286 sec
>> Check Latency:    0.01 sec    1.03 sec    0.488 sec
>> Percent State Change:    0.00%    0.00%    0.00%
>> Passive Service Checks:
>>     
>> Time Frame    Checks Completed
>> <= 1 minute:    0 (0.0%)
>> <= 5 minutes:    0 (0.0%)
>> <= 15 minutes:    0 (0.0%)
>> <= 1 hour:    0 (0.0%)
>> Since program start:      0 (0.0%)
>>     
>> Metric    Min.    Max.    Average
>> Percent State Change:      0.00%    0.00%    0.00%
>> Active Host Checks:
>>     
>> Time Frame    Checks Completed
>> <= 1 minute:    0 (0.0%)
>> <= 5 minutes:    0 (0.0%)
>> <= 15 minutes:    0 (0.0%)
>> <= 1 hour:    0 (0.0%)
>> Since program start:      19 (76.0%)
>>     
>> Metric    Min.    Max.    Average
>> Check Execution Time:      3.01 sec    4.01 sec    3.972 sec
>> Check Latency:    0.00 sec    0.00 sec    0.000 sec
>> Percent State Change:    0.00%    0.00%    0.00%
>> Passive Host Checks:
>>     
>> Time Frame    Checks Completed
>> <= 1 minute:    0 (0.0%)
>> <= 5 minutes:    0 (0.0%)
>> <= 15 minutes:    0 (0.0%)
>> <= 1 hour:    0 (0.0%)
>> Since program start:      0 (0.0%)
>>     
>> Metric    Min.    Max.    Average
>> Percent State Change:      0.00%    0.00%    0.00%
>>
>> ---------------------------------------------------------------------------------------------------------------------------- 
>>
>>
>> Nagios 2.0b4
>> Copyright (c) 1999-2005 Ethan Galstad (http://www.nagios.org)
>> Last Modified: 08-02-2005
>> License: GPL
>>
>> Projected scheduling information for host and service
>> checks is listed below.  This information assumes that
>> you are going to start running Nagios with your current
>> config files.
>>
>> HOST SCHEDULING INFORMATION
>> ---------------------------
>> Total hosts:                     25
>> Total scheduled hosts:           0
>> Host inter-check delay method:   SMART
>> Average host check interval:     0.00 sec
>> Host inter-check delay:          0.00 sec
>> Max host check spread:           30 min
>> First scheduled check:           N/A
>> Last scheduled check:            N/A
>>
>>
>> SERVICE SCHEDULING INFORMATION
>> -------------------------------
>> Total services:                     99
>> Total scheduled services:           99
>> Service inter-check delay method:   SMART
>> Average service check interval:     300.00 sec
>> Inter-check delay:                  3.03 sec
>> Interleave factor method:           SMART
>> Average services per host:          3.96
>> Service interleave factor:          4
>> Max service check spread:           30 min
>> First scheduled check:              Mon Dec 12 17:39:51 2005
>> Last scheduled check:               Mon Dec 12 17:44:47 2005
>>
>>
>> CHECK PROCESSING INFORMATION
>> ----------------------------
>> Service check reaper interval:      10 sec
>> Max concurrent service checks:      Unlimited
>>
>>
>> PERFORMANCE SUGGESTIONS
>> -----------------------
>> I have no suggestions - things look okay.
>>
>>
>> --------------------------------------------------------------------------------------------------------------------------------- 
>>
>>
>> grep -v ^# /etc/nagios/nagios.cfg  | grep -v ^$
>> Nagios.cfg params:
>>
>> log_file=/var/log/nagios/nagios.log
>> cfg_file=/etc/nagios/checkcommands.cfg
>> cfg_file=/etc/nagios/misccommands.cfg
>> cfg_file=/etc/nagios/contactgroups.cfg
>> cfg_file=/etc/nagios/contacts.cfg
>> cfg_file=/etc/nagios/dependencies.cfg
>> cfg_file=/etc/nagios/escalations.cfg
>> cfg_file=/etc/nagios/hostgroups.cfg
>> cfg_file=/etc/nagios/hosts.cfg
>> cfg_file=/etc/nagios/services.cfg
>> cfg_file=/etc/nagios/timeperiods.cfg
>> object_cache_file=/var/log/nagios/objects.cache
>> resource_file=/etc/nagios/resource.cfg
>> status_file=/var/log/nagios/status.dat
>> nagios_user=nagios
>> nagios_group=nagios
>> check_external_commands=1
>> command_check_interval=-1
>> command_file=/var/log/nagios/rw/nagios.cmd
>> comment_file=/var/log/nagios/comments.dat
>> downtime_file=/var/log/nagios/downtime.dat
>> lock_file=/var/run/nagios.pid
>> temp_file=/var/log/nagios/nagios.tmp
>> event_broker_options=-1
>> log_rotation_method=d
>> log_archive_path=/var/log/nagios/archives
>> use_syslog=1
>> log_notifications=1
>> log_service_retries=1
>> log_host_retries=1
>> log_event_handlers=1
>> log_initial_states=0
>> log_external_commands=1
>> log_passive_checks=1
>> service_inter_check_delay_method=s
>> max_service_check_spread=30
>> service_interleave_factor=s
>> host_inter_check_delay_method=s
>> max_host_check_spread=30
>> max_concurrent_checks=0
>> service_reaper_frequency=10
>> auto_reschedule_checks=0
>> auto_rescheduling_interval=30
>> auto_rescheduling_window=180
>> sleep_time=0.25
>> service_check_timeout=60
>> host_check_timeout=30
>> event_handler_timeout=30
>> notification_timeout=30
>> ocsp_timeout=5
>> perfdata_timeout=5
>> retain_state_information=1
>> state_retention_file=/var/log/nagios/retention.dat
>> retention_update_interval=60
>> use_retained_program_state=1
>> use_retained_scheduling_info=0
>> interval_length=60
>> use_aggressive_host_checking=0
>> execute_service_checks=1
>> accept_passive_service_checks=1
>> execute_host_checks=1
>> accept_passive_host_checks=1
>> enable_notifications=1
>> enable_event_handlers=1
>> process_performance_data=0
>> obsess_over_services=0
>> check_for_orphaned_services=0
>> check_service_freshness=1
>> service_freshness_check_interval=60
>> check_host_freshness=0
>> host_freshness_check_interval=60
>> aggregate_status_updates=1
>> status_update_interval=15
>> enable_flap_detection=0
>> low_service_flap_threshold=5.0
>> high_service_flap_threshold=20.0
>> low_host_flap_threshold=5.0
>> high_host_flap_threshold=20.0
>> date_format=us
>> p1_file=/usr/bin/p1.pl
>> illegal_object_name_chars=`~!$%^&*|'"<>?,()=
>> illegal_macro_output_chars=`~$&|'"<>
>> use_regexp_matching=0
>> use_true_regexp_matching=0
>> admin_email=nagios
>> admin_pager=pagenagios
>> daemon_dumps_core=0
>>
>> Any help is much appreciated.
>>
>> Thank you,
>>
>> Sheeri Kritzer
>>
>>
>> -------------------------------------------------------
>> This SF.net email is sponsored by: Splunk Inc. Do you grep through log 
>> files
>> for problems?  Stop!  Download the new AJAX search engine that makes
>> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
>> http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
>> _______________________________________________
>> Nagios-users mailing list
>> Nagios-users at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nagios-users
>> ::: Please include Nagios version, plugin version (-v) and OS when 
>> reporting any issue. ::: Messages without supporting info will risk 
>> being sent to /dev/null
>>
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log 
> files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when 
> reporting any issue. ::: Messages without supporting info will risk 
> being sent to /dev/null
> 



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list