NDO utils bug/explanation

Frassinelli, Marco marco.frassinelli at eds.com
Fri Sep 18 14:57:12 CEST 2009


Hello

The process of starting ndo2db and then Nagios makes sure that there is
actual data within the DB. If there is an outdated data within the DB it
needs to be removed before Nagios even sends new data. So the process of
trimming those table entries is truly intentional at the beginning
(so-called pre-launch state where the if condition matches). If ndo2db
fails for some reason, those data will remain within the database and
then removed during the next start. 

                Ok, when the parent ndo2db process starts. But not when
the child starts.


Depending on your startup routine I would guess that you started Nagios
first and then ndo2db. But it shouldn't because ndomod as an event
broker keeps data not written to ndo2db in a defined cache. Depending on
your configuration this cache may be to little so the oldest entry could
be lost (in this case the programstatus of Nagios). But that's really a
guess you'll have to give more information where and when this error
occurs mentioning all circumstances you'll catch up in the logs (turn on
very detailed and everything in debug_level in case). 

                Ndo2db starts first. In the case ndomod can't write to
ndo2db it logs the condition, and this is not the case. Every log are at
maximum detail.

 

Which code and which configuration?
The only thing I can see here is tstamp.tv_sec which is a converted
timestamp got from eventbroker module. This is kind of now() but
recently a now() from Nagios itsself. You may check
ndo2db.c;ndo2db_convert_standard_data_elements
The other compared value is dbinfo.latest_realtime_data_time which is
initialized in db.c:ndo2db_db_init and then updated if
dbinfo.latest_program_status_time newer (db.c:374; directly to that
value). There are several other realtime datavalues which may update
this value.
So the clue of this data is - if actual Nagios NDO_DATA_TIMESTAMP is
newer than the latest realtime data gotten some time before, it is time
for a cleanup at the very beginning of ndo2db (check the sequence in
ido2db.c:main). 

                The code and configuration of visualization software as
nagvis.


The conditional statement does not only insist on the difference 0 or
more but also if it is a process pre launch (see above). But besides a
question - how did you get to this values? Current NDOUtils code doesn't
give and debug information at this stage. 

                I am referring to the code in visualization software,
not ndo. 

  
Seeing your ndo2db die and refork explains why the pre_launch_state and
timestamp condition is matching and so within each period of time,
database cleanup is performed.
It would be interesting why ndo2db is dying. Depending on your
configuration this may vary - tcp or unix socket e.g.? What about more
detailed debuglogs or are there messages like "error writing to
datasink" in the logs?

                The clue here is that ndo2db doesn't  seem to have died
instead it exit gracefully, as you can see from the strace output below.

                The other strange thing is that ndo2db refork exactly
every 60 seconds. Where in the code I can find this interval? Why the
child ndo2db exits?

 [root at itsss214 var]# ps -ef |grep ndo2

nagios   24831     1  0 09:53 ?        00:00:00 /opt/nagios/bin/ndo2db
-c /opt/nagios/etc/ndo2db.cfg

nagios   24957 24831  1 14:51 ?        00:00:00 /opt/nagios/bin/ndo2db
-c /opt/nagios/etc/ndo2db.cfg

                Note the child startup time, 14:51, parent started at
9:53

 

time(NULL)                              = 1253276162

gettimeofday({1253276162, 271627}, NULL) = 0

write(3, "[1253276162.271627] [002.0] [pid"..., 272) = 272

lseek(3, 0, SEEK_CUR)                   = 53846244

poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0

write(7, "\350\0\0\0\3INSERT INTO nagios_processe"..., 236) = 236

read(7, "\t\0\0\1\0\1\374\371\341\2\0\0\0", 16384) = 13

time([1253276162])                      = 1253276162

time(NULL)                              = 1253276162

gettimeofday({1253276162, 272624}, NULL) = 0

write(3, "[1253276162.272624] [002.0] [pid"..., 272) = 272

lseek(3, 0, SEEK_CUR)                   = 53846516

poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0

write(7, "\350\0\0\0\3INSERT INTO nagios_processe"..., 236) = 236

read(7, "\t\0\0\1\0\1\374\372\341\2\0\0\0", 16384) = 13

gettimeofday({1253276162, 272626}, NULL) = 0

write(3, "[1253276162.272626] [002.0] [pid"..., 163) = 163

lseek(3, 0, SEEK_CUR)                   = 53846679

poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0

write(7, "{\0\0\0\3UPDATE nagios_programstatus"..., 127) = 127

read(7, "0\0\0\1\0\1\0\2\0\0\0(Rows matched: 1  Cha"..., 16384) = 52

time([1253276162])                      = 1253276162

time(NULL)                              = 1253276162

time([1253276162])                      = 1253276162

read(6, "\n219:\n1=1603\n2=0\n3=0\n4=125327616"..., 511) = 46

time(NULL)                              = 1253276162

time([1253276162])                      = 1253276162

read(6, "\n1000\nENDTIME: 1253276162\nGOODBY"..., 511) = 35

read(6, "", 511)                        = 0

gettimeofday({1253276162, 312611}, NULL) = 0

write(3, "[1253276162.312611] [002.0] [pid"..., 258) = 258

lseek(3, 0, SEEK_CUR)                   = 53846937

poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0

write(7, "\332\0\0\0\3UPDATE nagios_conninfo SET "..., 222) = 222

read(7, "0\0\0\1\0\1\0\2\0\0\0(Rows matched: 1  Cha"..., 16384) = 52

write(7, "\1\0\0\0\1", 5)               = 5

shutdown(7, 2 /* send and receive */)   = 0

close(7)                                = 0

time([1253276162])                      = 1253276162

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=951, ...}) = 0

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=951, ...}) = 0

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=951, ...}) = 0

sendto(8, "<14>Sep 18 14:16:02 ndo2db: Succ"..., 73, MSG_NOSIGNAL, NULL,
0) = 73

close(6)                                = 0

close(3)                                = 0

munmap(0x2a95557000, 4096)              = 0

exit_group(0)                           = ?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20090918/6e338a63/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Come build with us! The BlackBerry® Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9-12, 2009. Register now!
http://p.sf.net/sfu/devconf
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list