status.cgi: malloc(): memory corruption (fast)

Bo Philip Larsen boph at tdchosting.dk
Mon Dec 22 09:29:13 CET 2008


Hi,

I do not know whether this is a developer issue, but anyway I try

I have explored this error, is this error a bug in the cgi's or ...

We have a distributed nagios setup with one master server and 8 slave server, we monitor 2000+ hosts and 12000+ services, all nagios servers running RedHat EL5 and now the master server running nagios 3.0.6 and distributed servers 3.0.3. We use nrpe 2.10 for client check and nsca 2.7.2  for communication between the nagios servers with  Simple XOR encryption.

The problem was discovered on the master server by the cgi's went blank in the browser, the nagios process was running and processing.  All distributed server was working. The nagios log show no error but a look into the apache error_log shows:

[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] *** glibc detected *** /usr/local/nagios/sbin/status.cgi: malloc(): memory corruption (fast): 0x084104c0 ***
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] ======= Backtrace: =========
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] /lib/libc.so.6[0x94e91e]
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] /lib/libc.so.6(__libc_malloc+0x7e)[0x94f35e]
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] /usr/local/nagios/sbin/status.cgi[0x8056b16]
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] /usr/local/nagios/sbin/status.cgi[0x80719fc]
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] /usr/local/nagios/sbin/status.cgi[0x8057c5e]
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] /usr/local/nagios/sbin/status.cgi[0x8054047]
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] /lib/libc.so.6(__libc_start_main+0xdc)[0x8fadec]
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] /usr/local/nagios/sbin/status.cgi[0x8048ec1]
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] ======= Memory map: ========
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 00110000-0011b000 r-xp 00000000 68:03 7987975    /lib/libgcc_s-4.1.2-20080102.so.1
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 0011b000-0011c000 rwxp 0000a000 68:03 7987975    /lib/libgcc_s-4.1.2-20080102.so.1
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 008c7000-008e1000 r-xp 00000000 68:03 7987623    /lib/ld-2.5.so
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 008e1000-008e2000 r-xp 00019000 68:03 7987623    /lib/ld-2.5.so
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 008e2000-008e3000 rwxp 0001a000 68:03 7987623    /lib/ld-2.5.so
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 008e5000-00a22000 r-xp 00000000 68:03 7987965    /lib/libc-2.5.so
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 00a22000-00a24000 r-xp 0013d000 68:03 7987965    /lib/libc-2.5.so
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 00a24000-00a25000 rwxp 0013f000 68:03 7987965    /lib/libc-2.5.so
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 00a25000-00a28000 rwxp 00a25000 00:00 0
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 00a30000-00a43000 r-xp 00000000 68:03 7987973    /lib/libpthread-2.5.so
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 00a43000-00a44000 r-xp 00012000 68:03 7987973    /lib/libpthread-2.5.so
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 00a44000-00a45000 rwxp 00013000 68:03 7987973    /lib/libpthread-2.5.so
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 00a45000-00a47000 rwxp 00a45000 00:00 0
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 00ae9000-00af0000 r-xp 00000000 68:03 7987974    /lib/librt-2.5.so
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 00af0000-00af1000 r-xp 00006000 68:03 7987974    /lib/librt-2.5.so
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 00af1000-00af2000 rwxp 00007000 68:03 7987974    /lib/librt-2.5.so
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 00af4000-00b91000 r-xp 00000000 68:03 7987981    /lib/libglib-2.0.so.0.1200.3
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 00b91000-00b92000 rwxp 0009c000 68:03 7987981    /lib/libglib-2.0.so.0.1200.3
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 00ec2000-00ec3000 r-xp 00ec2000 00:00 0          [vdso]
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 08048000-08080000 r-xp 00000000 68:03 8544239    /usr/local/nagios/sbin/status.cgi
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 08080000-08081000 rw-p 00038000 68:03 8544239    /usr/local/nagios/sbin/status.cgi
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 08081000-08084000 rw-p 08081000 00:00 0
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] 08173000-0a0a4000 rw-p 08173000 00:00 0
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] b6900000-b6921000 rw-p b6900000 00:00 0
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] b6921000-b6a00000 ---p b6921000 00:00 0
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] b6a3c000-b7f07000 r--p 00000000 68:03 8544989    /usr/local/nagios/var/status.dat
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] b7f07000-b7f09000 rw-p b7f07000 00:00 0
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] bfc67000-bfc7c000 rw-p bfc67000 00:00 0          [stack]
[Fri Dec 19 04:08:39 2008] [error] [client x.x.x.x] Premature end of script headers: status.cgi

After some googleling we found http://archive.netbsd.se/?ml=nagiosplug-devel&a=2008-04&m=7148084

I tried to stop the nagios process on the master server delete the retention.dat file and start nagios again this result in a working cgi interface for a minute or so, a look into the retention.dat file on the distributed nagios server shows this on a single service check (the check plugin is used on 500+ services checks):

service {
host_name=ims-v5
service_description=check_log_tdc_rman
modified_attributes=0
check_command=check_nrpe!9991!check_log_tdc_rman
check_period=24x7
notification_period=24x7
event_handler=
has_been_checked=1
check_execution_time=0.034
check_latency=0.370
check_type=0
current_state=3
last_state=3
last_hard_state=3
last_event_id=5613
current_event_id=5647
current_problem_id=2775
last_problem_id=2567
current_attempt=4
max_attempts=4
current_event_id=5647
last_event_id=5613
normal_check_interval=5.000000
retry_check_interval=1.000000
state_type=1
last_state_change=1229684592
last_hard_state_change=1229684592
last_time_ok=1229683949
last_time_warning=0
last_time_unknown=1229714047
last_time_critical=1229684532
plugin_output=FATAL: File '/tmp/chk_RMAN_alertlog.log' not found or not readable.
long_plugin_output=ÿ>ø<80>ÿ4^Eh\n
performance_data=
last_check=1229714047
next_check=1229714347
check_options=0
notified_on_unknown=0
notified_on_warning=0
notified_on_critical=0
current_notification_number=0
current_notification_id=0
last_notification=0
notifications_enabled=0
active_checks_enabled=1
passive_checks_enabled=1
event_handler_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
flap_detection_enabled=1
failure_prediction_enabled=1
process_performance_data=1
obsess_over_service=1
is_flapping=0
percent_state_change=0.00
check_flapping_recovery_notification=0
state_history=3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
_NOC_STRING=0;INGEN_VAGTGRUPPE
}

A look into the service extinfo cgi on the distributed nagios server shows this

Service check_log_tdc_rman On Host ims-v5 (ims-v5)
Member of all_services, log_files
x.x.x.x
Service State Information Current Status:      UNKNOWN   (for 0d 8h 34m 33s)
Status Information:    FATAL: File '/tmp/chk_RMAN_alertlog.log' not found or not readable. ?>??4h
Performance Data:
Current Attempt:    4/4  (HARD state)
Last Check Time:    2008-12-19 20:34:07
Check Type:    ACTIVE
Check Latency / Duration:    0.270 / 0.044 seconds


Running the service check by hand returns this

[root at nagsrv003 nagios]# ./bin/check_nrpe -H ims-v5 -p 9991 -n -c check_log_tdc_rman
FATAL: File '/tmp/chk_RMAN_alertlog.log' not found or not readable.
#>#####h

Then I removed this single service check everything works again.

Yes, maybe I have an bug in the plugin script, but why does the cgi's on the master server fails with a memory error, then all the distributed servers works?


Regards

Bo Larsen



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20081222/647ec7af/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list