Apparently incomplete contents of log files

Skip Montanaro skip at pobox.com
Tue Dec 23 22:20:11 CET 2003


I'm attempting to do my own log file analysis for some quarterly reports our
management wants and am confused by what I see in the archived logs and what
nagios reports on-screen.  Consider one host here, almamater.  Looking at
the lines in the log file related to it on September 25th I see:

    [1064438580] Warning: Return code of 139 for check of service 'PING' on host 'almamater.itcs.northwestern.edu' was out of bounds.
    [1064438580] HOST ALERT: almamater.itcs.northwestern.edu;DOWN;SOFT;1;(No output!)
    [1064438580] HOST ALERT: almamater.itcs.northwestern.edu;DOWN;SOFT;2;(No output!)
    [1064438580] HOST ALERT: almamater.itcs.northwestern.edu;DOWN;SOFT;3;(No output!)
    [1064438581] HOST ALERT: almamater.itcs.northwestern.edu;DOWN;SOFT;4;(No output!)
    [1064438581] HOST ALERT: almamater.itcs.northwestern.edu;DOWN;SOFT;5;(No output!)
    [1064438581] HOST ALERT: almamater.itcs.northwestern.edu;DOWN;SOFT;6;(No output!)
    [1064438581] HOST ALERT: almamater.itcs.northwestern.edu;DOWN;SOFT;7;(No output!)
    [1064438581] HOST ALERT: almamater.itcs.northwestern.edu;DOWN;SOFT;8;(No output!)
    [1064438581] HOST ALERT: almamater.itcs.northwestern.edu;DOWN;SOFT;9;(No output!)
    [1064438581] HOST ALERT: almamater.itcs.northwestern.edu;DOWN;HARD;10;(No output!)
    [1064438581] HOST NOTIFICATION: donh;almamater.itcs.northwestern.edu;DOWN;host-notify-by-email;(No output!)
    [1064438581] SERVICE ALERT: almamater.itcs.northwestern.edu;PING;CRITICAL;HARD;1;(Return code of 139 is out of bounds)

That's all well and good, but note the last HOST ALERT line.  At that point,
the host is DOWN/HARD and the system engineer is notified.  Looking at more
recent logs I never see a state change back to UP (or any other state), but
the machine is clearly up and nagios reports it as up.  Here's the info
Nagios displays on-screen about that particular alert:

    2003-09-24T16:23:01 2003-09-24T16:25:56 0d 0h 2m 55s HOST DOWN (No output!)

There is *nothing* in that log file or any other more recent log file to
show that the machine came back up.  Interestingly enough, nagios itself was
restarted at 16:25:56 then restarted a bit later:

    ...
    [1064438756] Caught SIGTERM, shutting down...
    [1064438756] Successfully shutdown... (PID=22201)
    [1064438756] Successfully shutdown... (PID=22201)
    [1064439292] Nagios 1.1 starting... (PID=14981)
    [1064439292] Warning: Host 'csd.soc.northwestern.edu' is not a member of any host groups!
    ...

Is nagios failing to write a record to the log file after its first check
which indicates the machine/service is up?  If so, how does it calculate the
2m 55s duration of the outage?

Thx,

-- 
Skip Montanaro
Got gigs? http://www.musi-cal.com/submit.html
Got spam? http://spambayes.sf.net/
skip at pobox.com


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list