Nagios ignores broken file descriptor?

Steven D. Morrey smorrey at ldschurch.org
Tue Nov 18 18:37:49 CET 2008


Here is an strace on the same box from just a few minutes ago.
As you can see whats happening is Nagios does not appear to be catching
the error about trying to write to a read only file system.

nanosleep({1, 0},{1, 0})               = 0
kill(-7799, SIGKILL)                    = -1 ESRCH (No such process)
gettimeofday({1227029613, 509530}, NULL) = 0
close(10)                               = 0
open("/usr/local/nagios/var/nagios.log", O_RDWR|O_APPEND|O_CREAT, 0666)
= -1 EROFS (Read-only file system)
gettimeofday({1227029613, 511146}, NULL) = 0
gettimeofday({1227029613, 511316}, NULL) = 0
time([1227029613])                      = 1227029613
time([1227029613])                      = 1227029613
gettimeofday({1227029613, 511843}, NULL) = 0
time([1227029613])                      = 1227029613
gettimeofday({1227029613, 512141}, NULL) = 0
gettimeofday({1227029613, 512287}, NULL) = 0
gettimeofday({1227029613, 511659}, NULL) = 0
time([1227029613])                      = 1227029613
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=877, ...}) = 0
gettimeofday({1227029613, 512585}, NULL) = 0
gettimeofday({1227029613, 512746}, NULL) = 0
gettimeofday({1227029613, 513905}, NULL) = 0
gettimeofday({1227029613, 514058}, NULL) = 0
time([1227029613])                      = 1227029613
time([1227029613])                      = 1227029613
time([1227029613])                      = 1227029613
gettimeofday({1227029613, 513917}, NULL) = 0
gettimeofday({1227029613, 514869}, NULL) = 0
time([1227029613])                      = 1227029613
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=877, ...}) = 0
time([1227029613])                      = 1227029613
time([1227029613])                      = 1227029613
gettimeofday({1227029613, 522324}, NULL) = 0
time([1227029613])                      = 1227029613
gettimeofday({1227029613, 521730}, NULL) = 0
pipe([10, 11])                          = 0
fcntl64(10, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
fcntl64(11, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
gettimeofday({1227029613, 522435}, NULL) = 0
gettimeofday({1227029613, 522584}, NULL) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|
SIGCHLD, child_tidptr=0x40176708) = 7802
close(11)                               = 0
waitpid(7802, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0) = 7802
--- SIGCHLD (Child exited) @ 0 (0) ---

As always thanks for looking into this with me.

Sincerely,
Steven D. Morrey
 
On Tue, 2008-11-18 at 10:28 -0700, Steven D. Morrey wrote:
> Hello Everyone,
> Over the weekend my test implementation of Nagios stopped recording
> results.
> I checked with ps and it is still running along just fine but it appears
> to have lost the ability to write out results.
> After doing some checking I noticed that it stopped writing at 5pm last
> Saturday.
> On a hunch I checked my /var/messages and found this little beauty of an
> error.
> 
> Nov 15 05:01:43 test-system kernel: SCSI error : <0 0 0 0> return code =
> 0x20008
> Nov 15 05:01:43 test-system kernel: end_request: I/O error, dev sda,
> sector 27780153
> Nov 15 05:01:43 test-system kernel: buffer layer error at
> fs/buffer.c:2996
> Nov 15 05:01:43 test-system kernel: Call Trace:
> Nov 15 05:01:43 test-system kernel:  [<c0160649>] drop_buffers
> +0x149/0x1c0
> Nov 15 05:01:43 test-system kernel:  [<c01606e4>] try_to_free_buffers
> +0x24/0x70
> Nov 15 05:01:43 test-system kernel:  [<f8cbe5cc>] reiserfs_releasepage
> +0x5c/0xa0 [reiserfs]
> Nov 15 05:01:43 test-system kernel:  [<f8cbe570>] reiserfs_releasepage
> +0x0/0xa0 [reiserfs]
> Nov 15 05:01:43 test-system kernel:  [<c0160765>] try_to_release_page
> +0x35/0x50
> Nov 15 05:01:43 test-system kernel:  [<f8cbe76c>]
> reiserfs_invalidatepage+0x15c/0x1b0 [reiserfs]
> Nov 15 05:01:43 test-system kernel:  [<c01494c4>] do_invalidatepage
> +0x14/0x30
> Nov 15 05:01:43 test-system kernel:  [<c01499ce>] truncate_complete_page
> +0x9e/0xc0
> Nov 15 05:01:43 test-system kernel:  [<c0149a93>] truncate_inode_pages
> +0xa3/0x300
> Nov 15 05:01:43 test-system kernel:  [<f8cc2b70>] reiserfs_delete_inode
> +0x0/0xdc [reiserfs]
> Nov 15 05:01:43 test-system kernel:  [<f8cc2b88>] reiserfs_delete_inode
> +0x18/0xdc [reiserfs]
> Nov 15 05:01:43 test-system kernel:  [<c017608d>] __d_move+0xed/0x1f0
> Nov 15 05:01:43 test-system kernel:  [<c016bcb4>] vfs_rename_other
> +0x74/0x110
> Nov 15 05:01:43 test-system kernel:  [<f8cc2b70>] reiserfs_delete_inode
> +0x0/0xdc [reiserfs]
> Nov 15 05:01:43 test-system kernel:  [<c0177c14>] generic_delete_inode
> +0x94/0x120
> Nov 15 05:01:43 test-system kernel:  [<c0176de7>] iput+0x57/0x90
> Nov 15 05:01:43 test-system kernel:  [<c0175537>] dput+0x17/0x180
> Nov 15 05:01:43 test-system kernel:  [<c016eccb>] sys_rename+0x24b/0x2c0
> Nov 15 05:01:43 test-system kernel:  [<c0107db9>] sysenter_past_esp
> +0x52/0x79
> Nov 15 05:01:43 test-system kernel: 
> Nov 15 05:01:54 test-system kernel: REISERFS: abort (device dm-1):
> Journal write error in flush_commit_list
> Nov 15 05:01:54 test-system kernel: REISERFS: Aborting journal for
> filesystem on dm-1
> 
> Now I think the root cause of the file system error was that an ntpd
> daemon was running and set the system time somewhere in the past thereby
> confusing the filesystem.  But that is neither here nor there.
> 
> I would normally expect the program to either receive a SIGPIPE or at a
> minimum have the write operation return an error of some sort and either
> shut the system down or restart nagios.  But in this case nothing is
> happening.  Is this normal behavior for Nagios, or am I missing
> something?
> 
> For the record we are running a modified version of nagios 2.7,  with
> dnx 0.19, on  SLES 9 patch level 4 so if this is a known bug that was
> fixed in a later version of nagios, I would really appreciate knowing
> about that as well.
> 
> Thanks in advance!
> 
> Sincerely,
> Steven Morrey
> 
> 
>  NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
> 
> 
> 
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/




More information about the Developers mailing list