status.dat not created

Marc Powell marc at ena.com
Wed Dec 3 19:57:08 CET 2003


Thanks for the input Ethan. I've made progress but I'm still not seeing
status.dat created. See inline comments --

> -----Original Message-----
> From: Ethan Galstad [mailto:nagios at nagios.org]
> Sent: Tuesday, December 02, 2003 9:10 PM
> To: nagios-devel at lists.sourceforge.net
> Subject: Re: [Nagios-devel] status.dat not created
> 
> The status file is only created/updated:
> 
> 	1. At regular intervals when aggregated updates are enabled
> 
> 	and/or
> 
> 	2. When the status of a host or service changes (i.e. a check is
> performed, a notification occurs, etc.)
> 
> Check the log file to make sure the passive checks are being received
> and processed.  I'm guessing that's where the problem lies - if the
> passive checks aren't being processed, there's no reason for Nagios
> to create the status file.  Actually, that might not be totally
> correct.  Nagios should create a status file immediately upon
> startup.  Check your config file to make sure you don't have more
> than one status_file definition, etc.

Actually, I was having a problem with the passive checks. I didn't
realize that iptables was configured on this machine. Once that was
corrected (read 'disabled'), I am able to see nagios accept the passive
check (via strace), but nothing ever gets logged, nor is the status.dat
file ever created. As you mentioned, I would expect nagios to create the
status_file on startup regardless (even if it is with faked info). First
some more relevant config information --

[root at daginbox etc]# grep status_file nagios.cfg
status_file=/usr/local/nagios/var/status.dat

[root at daginbox etc]# grep passive nagios.cfg | grep -v "^#"
log_passive_checks=1
accept_passive_service_checks=1
accept_passive_host_checks=1

and these --
aggregate_status_updates=1
status_update_interval=10

(I have all the logging options enabled as well as compiled with
--enable-DEBUG0 and --enable-DEBUG1 now) Nagios.log only shows --

[1070473990] Nagios 2.0a1 starting... (PID=7131)
[1070473990] LOG VERSION: 2.0
[1070474026] Nagios 2.0a1 starting... (PID=7169)
[1070474026] LOG VERSION: 2.0
[1070474089] Nagios 2.0a1 starting... (PID=7263)
[1070474089] LOG VERSION: 2.0
[1070474138] Nagios 2.0a1 starting... (PID=7305)
[1070474138] LOG VERSION: 2.0
[1070474805] Nagios 2.0a1 starting... (PID=7565)
[1070474805] LOG VERSION: 2.0



Strace output of nagios reading passive check from command file --

[snip]
[pid  7169] open("/usr/local/nagios/var/retention.dat", O_RDONLY) = 4
[pid  7169] fstat64(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid  7169] mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xbeb47000
[pid  7169] read(4, "", 4096)           = 0
[pid  7169] close(4)                    = 0
[pid  7169] munmap(0xbeb47000, 4096)    = 0
[pid  7169] time([1070474033])          = 1070474033
[pid  7170] <... select resumed> )      = 0 (Timeout)
[pid  7170] read(3, "[1070474024]
PROCESS_SERVICE_CHECK_RESULT;tnops-rhea-hs.rhea.tn.ena.net;WCCP;0;WCCP
OK: Total Packets Redirected:            30729337\n[1070474024]
PROCESS_SERVICE_CHECK_RESULT;tnops-scott-ecr-cache.scott.tn.ena.net;CACH
ING;0;<A
href=\"http://monitor.corp.ena.net/cgi/tools.cgi?ip=208.183.104.58&host=
scott-ecr-cache.scott.tn.ena.net&service=CACHING\">HTTP ok: HTTP/1.0 200
OK - 1 second response time </A>\n[1070474024]
PROCESS_SERVICE_CHECK_RESULT;tnops-rhea-alt.rhea.tn.ena.net;MULTI-EGRESS
;0;<A href=\"http://moni"..., 4096) = 3887
[pid  7170] read(3, "", 4096)           = 0
[pid  7170] select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
[pid  7170] read(3, "[1070474034]
PROCESS_SERVICE_CHECK_RESULT;tnops-pigeonforge-hs.sevier.tn.ena.net;MULT
I-EGRESS;0;<A
href=\"http://monitor.corp.ena.net/cgi/tools.cgi?ip=172.31.115.13&host=p
igeonforge-hs.sevier.tn.ena.net&service=MULTI_EGRESS\">EGRESS CHECK OK -
All circuits up</A>\n", 4096) = 262
[pid  7170] read(3, "", 4096)           = 0
[pid  7170] select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
[pid  7170] read(3, "", 4096)           = 0
[pid  7170] select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
[pid  7170] read(3, "", 4096)           = 0

Before we do the above stuff, I see this in the output --

[pid  7169] munmap(0xbeb47000, 4096)    = 0
[pid  7169] unlink("/usr/local/nagios/var/status.dat") = -1 ENOENT (No
such file or directory)
[pid  7169] open("/usr/local/nagios/etc/nagios.cfg", O_RDONLY) = 4
[pid  7169] fstat64(4, {st_mode=S_IFREG|0644, st_size=27627, ...}) = 0
[pid  7169] mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xbeb47000
[pid  7169] read(4,
"#######################################################################
#######\n#\n# NAGIOS.CFG - Sample Main Config File for Nagios \n#\n#
Read the documentation for more information on this configuration\n#
file.  I\'ve provided some comments here, but things may not be so\n#
clear without further explanation.\n#\n# Last Modified:
11-08-2003\n#\n#########################################################
#####################\n\n\n# LOG FILE\n# This is the main log file where
service and host events are logged\n# for histor"..., 4096) = 4096
[pid  7169] read(4, "nagios_user=nagios\n\n\n\n# NAGIOS GROUP\n# This
determines the effective group that Nagios should run as.  \n# You can
either supply a group name or a GID.\n\nnagios_group=nagios\n\n\n\n#
EXTERNAL COMMAND OPTION\n# This option allows you to specify whether or
not Nagios should check\n# for external commands (in the command file
defined below).  By default\n# Nagios will *not* check for external
commands, just to be on the\n# cautious side.  If you want to be able to
use the CGI command interface\n# you will have to enable "..., 4096) =
4096
[pid  7169] read(4, ".\n\nuse_syslog=0\n\n\n\n# NOTIFICATION LOGGING
OPTION\n# If you don\'t want notifications to be logged, set this value
to 0.\n# If notifications should be logged, set the value to
1.\n\nlog_notifications=1\n\n\n\n# SERVICE RETRY LOGGING OPTION\n# If
you don\'t want service check retries to be logged, set this value\n# to
0.  If retries should be logged, set the value to
1.\n\nlog_service_retries=1\n\n\n\n# HOST RETRY LOGGING OPTION\n# If you
don\'t want host check retries to be logged, set this value to\n# 0.  If
retries should be l"..., 4096) = 4096
[pid  7169] read(4, "hecks when it starts monitoring.  The\n# default is
to use smart delay calculation, which will try to\n# space all host
checks out evenly to minimize CPU load.\n# Using the dumb setting will
cause all checks to be scheduled\n# at the same time (with no delay
between them)!\n#\tn\t= None - don\'t use any delay between
checks\n#\td\t= Use a \"dumb\" delay of 1 second between checks\n#\ts\t=
Use \"smart\" inter-check delay calculation\n#       x.xx    = Use an
inter-check delay of x.xx
seconds\n\nhost_inter_check_delay_method=s\n\n\n"..., 4096) = 4096
[pid  7169] read(4, "ach interval is one minute long (60 seconds).
Other settings\n# have not been tested much, so your mileage is likely
to vary...\n\ninterval_length=60\n\n\n\n# AGRESSIVE HOST CHECKING
OPTION\n# If you don\'t want to turn on agressive host checking
features, set\n# this value to 0 (the default).  Otherwise set this
value to 1 to\n# enable the agressive check option.  Read the docs for
more info\n# on what agressive host check is or check out the source
code in\n# base/checks.c\n\nuse_agressive_host_checking=0\n\n\n\n#
SERVICE "..., 4096) = 4096
[pid  7169] read(4, "ervice check that is\n# processed by Nagios.  This
command is executed only if the\n# obsess_over_service option (above) is
set to 1.  The command \n# argument is the short name of a command
definition that you\n# define in your host configuration file. Read the
HTML docs for\n# more information on implementing distributed
monitoring.\n\n#ocsp_command=somecommand\n\n\n\n# ORPHANED SERVICE CHECK
OPTION\n# This determines whether or not Nagios will periodically \n#
check for orphaned services.  Since service checks are no"..., 4096) =
4096
[pid  7170] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid  7169] read(4,  <unfinished ...>
[pid  7170] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid  7170] select(0, NULL, NULL, NULL, {0, 500000} <unfinished ...>
[pid  7169] <... read resumed>
"01\t(YYYY-MM-DDTHH:MM:SS)\n#\n\ndate_format=us\n\n\n\n# MAXIMUM
EMBEDDED PERL INTERPRETER CALLS\n# This value determines how often (if
at all) the embedded Perl\n# interpreter is reinitialized during
runtime.  This is useful\n# if you notice that the Perl interpreter is
causing slow \n# memory leaks over time.  Setting this value to 0 means
the \n# embedded Perl interpreter will never be reinitialized.  Any\n#
value > 0 is the number of times the embedded Perl interpreter\n# is
used (i.e. a Perl plugin is executed) before"..., 4096) = 3051
[pid  7169] read(4, "", 4096)           = 0
[pid  7169] close(4)                    = 0
[pid  7169] munmap(0xbeb47000, 4096)    = 0
[pid  7169] stat64("/usr/local/nagios/var/comments.dat",
{st_mode=S_IFREG|0664, st_size=240, ...}) = 0

That's the only occurrence of status.dat (the unlink) that I ever see.

> 
> I would also suggest that you enable aggregated status updates - not
> doing so is a huge waste of CPU/disk time when you have a lot of
> monitoring activity.

I agree and normally use it. I had disabled it for testing purposes
only, hoping that I would see different behavior.

So I'm still basically where I was. I now know that nagios is indeed
seeing the passive service checks but it doesn't look to be doing
anything with them. Nor is it logging anything (either to nagios.log or
to console now that I have debug enabled). The host/service config file
that I am using was taken directly from the monitoring host and only the
template has been modified to enable notifications. Some additional
verifications --

	- The service checks are being submitted to 3 boxes
simultaneously via submit_check_result. Once to two nagios-1.1 machines
and once to the nagios-2.0 machine. The other two boxes receive and
process the results appropriately. The only change in the command line
is the IP of the host to send to. 
 	- I'm seeing the same problem on a redhat 7.3 box (one of the
above machines concurrently running 1.1) and a Fedora Core 1 machine.

Any other thoughts or troubleshooting recommendations?

Thanks!

Marc


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/




More information about the Developers mailing list