NDO utils bug/explanation [Solved]

Michael Friedrich michael.friedrich at univie.ac.at
Tue Sep 22 09:50:47 CEST 2009


Hi Marco,

you are very welcome and thanks for your reply. It was nice getting 
deeper into the code for this use case and I'm assured it will help me 
on the development of Icinga IDOUtils either way removing the blocking 
behavior and enhance the debugoutput (next to rewriting the db handling 
code for other rdbms like Postgres, Oracle, SQLite).

Kind regards,
Michael

Frassinelli, Marco wrote the following on 22.09.2009 09:24:
>
> Hi Michael
>
>  
>
> Thank you very much for the explanations. Finally I found the problem, 
> it was a Nagios issue, too many reload where interrupting data flow.
>
> Now everything works very well.
>
> By
>
> Marco
>
>  
>
> *From:* Michael Friedrich [mailto:michael.friedrich at univie.ac.at]
> *Sent:* domenica 20 settembre 2009 2.34
> *To:* Nagios Developers List
> *Subject:* Re: [Nagios-devel] NDO utils bug/explanation
>
>  
>
> Hi,
>
> Frassinelli, Marco wrote:
>
> Hello
>
> The process of starting ndo2db and then Nagios makes sure that there 
> is actual data within the DB. If there is an outdated data within the 
> DB it needs to be removed before Nagios even sends new data. So the 
> process of trimming those table entries is truly intentional at the 
> beginning (so-called pre-launch state where the if condition matches). 
> If ndo2db fails for some reason, those data will remain within the 
> database and then removed during the next start.
>
>                 Ok, when the parent ndo2db process starts. But not 
> when the child starts.
>
> Considering the fact that the process runs daemonized and forks a 
> child for each ndomod client connection, that is truly wrong. See the 
> NDOUtils documentation page 8:
> http://www.scribd.com/doc/17608352/Nagios-NDOUtils
>
> Going deeper into the code in ndo2db.c:main you will recognize that 
> after initialization stuff that the standalone daemon condition matches.
>
>         /* standalone daemon... */
>         else{
>
>                 /* create socket and wait for clients to connect */
>                 if(ndo2db_wait_for_connections()==NDO_ERROR)
>                         return 1;
>                 }
>
> This function opens the socket wether tcp or unix. Then the parent 
> forks a child to handle client data (0 = everythings ok, -1 = error).
>
>                 /* fork... */
>                 new_pid=fork();
>
>                 switch(new_pid){
> [...]
>                 case 0:
> #endif
>                         /* child processes data... */
>                         ndo2db_handle_client_connection(new_sd);
>
>                         /* close socket when we're done */
>                         close(new_sd);
> #ifndef DEBUG_NDO2DB
>                         return NDO_OK;
>                         break;
> [...]
>                         }
>
> Parent runs in a while loop and if child has died on the client 
> connection, it forks a new one.
> The child then reads the data from the socket within 
> ndo2db_handle_client_connection. If there's nothing on the socket, the 
> client connection was lost and the while(1) terminates (also on 
> errors), meaning the child to terminate too. The child performs the 
> memory cleanup and db disconnect and then the process of the parent 
> accepting new client connection and forking a child for each client 
> connection starts again.
> Just look that up in the code, it's quite nice commented.
>
>                 /* check for completed lines of input */
>                 ndo2db_check_for_client_input(&idi,&dbuf);
>
> This function also decides if the actual client needs to be 
> disconnected. So there are 3 main options that can lead to the 
> termination of the child:
>
> * socket read result fails with -1, eagain/eintr -> 
> http://linux.die.net/man/2/read
> * socket read is empty
> * client input is wrong protocol (idi->disconnect_client)
>
> If one line has been read from the socket, it comes to 
> ndo2db_handle_client_input where the data frame is analyzed. If 
> clients protocol (ndomod) doesn't match ndo2db protocol, client will 
> be disconnected and data ignored.
>
>                         /* client is using wrong protocol version, 
> bail out here... */
>                         if(idi->protocol_version!=NDO_API_PROTOVERSION){
>                                 syslog(LOG_USER|LOG_INFO,"Error: 
> Client protocol version %d is incompatible with server version %d.  
> Disconnecting client...",idi->protocol_version,NDO_API_PROTOVERSION);
>                                 idi->disconnect_client=NDO_TRUE;
>                                 idi->ignore_client_data=NDO_TRUE;
>                                 return NDO_ERROR;
>                                 }
>
> Other data will be read, parsed and saved by type as 
> idi->buffered_input ... If the data section is complete given by 
> data_type==NDO_API_ENDDATA then ndo2db_end_input_data will be called 
> wherein the dbhandling starts e.g.
>
>         case NDO2DB_INPUT_DATA_PROGRAMSTATUSDATA:
>                 result=ndo2db_handle_programstatusdata(idi);
>                 break;
>
> values for the dbquery will be read 
> (dbhandler.c:,ndo2db_handle_programstatusdata) and then the queries 
> are sent to db - currently MySQL only, calling ndo2db_db_query
>
>
> But the real question is - where is ndo2db_handle_processdata called 
> that will perform the table trimming if program is starting up? Ok, 
> that's the same datahandling.
>
>         /* realtime Nagios data */
>         case NDO2DB_INPUT_DATA_PROCESSDATA:
>                 result=ndo2db_handle_processdata(idi);
>                 break;
>
> Looking at the way we got here it's quite simple to see that only the 
> child will call this function. So everytime the child dies and gets 
> forked again, the condition of starting up and outdated realtime data 
> matches. And furthermore, the trimming of the tables.
>
> -----------------------
> Stepping a bit more on the client side with ndomod, the data comes from
>
>                         /* realtime data */
>                         case NDO_API_PROCESSDATA:
>                                 
> idi->current_input_data=NDO2DB_INPUT_DATA_PROCESSDATA;
>                                 break;
>
> which is generated in ndomod.c:ndomod_broker_data
>
>         /* handle the event */
>         switch(event_type){
>
>         case NEBCALLBACK_PROCESS_DATA:
>
> and written to datasink
>
>         /* write data to sink */
>         if(write_to_sink==NDO_TRUE)
>                 ndomod_write_to_sink(dbuf.buf,NDO_TRUE,NDO_TRUE);
>
> -----------------------
> The conditional match for NEBTYPE_PROCESS_PRELAUNCH comes directly 
> from Nagios and the defines for the Broker module. This type will be 
> sent to the broker after loading all neb modules.
>
> It seems that your Nagios version does not change this type or it 
> simply reloads the broker module every 60 seconds. It could occur that 
> the nagios process is restarted for whatever reason, since the do 
> while loop runs on sigrestart==TRUE && sigshutdown==FALSE
>
> But I am not very familiar with that, that's just a guess. The only 
> thing is that the ndomod/ndo2db will receive the 
> NEBTYPE_PROCESS_PRELAUNCH and then do the work. So the conclusion 
> would be that they're working fine.
>
> The only thing I could imagine at last is that the data written/read 
> from the socket (tcp or unix) is truncated. The defines for the 
> NEBTYPEs only differ by one or two bits. That could lead into errors 
> but every 60 seconds is kind of weird....
>
>
>                 I am referring to the code in visualization software, 
> not ndo.
>
> Then you should point exactly to it as I am not aware what you meant 
> or which codebase you are using. Furthermore you should mention Nagios 
> and NDOUtils version, operating system and architecture.
>
>
>
>                 The other strange thing is that ndo2db refork exactly 
> every 60 seconds. Where in the code I can find this interval? Why the 
> child ndo2db exits?
>
> See above, it seems that the event broker/ndomod doesn't send any data 
> (or the socket is broken) and then a refork is performed.
>
> How did you do that trace?
>
> Kind regards,
> Michael
>
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry® Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay 
> ahead of the curve. Join us from November 9-12, 2009. Register now!
> http://p.sf.net/sfu/devconf
> ------------------------------------------------------------------------
>
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>   

-- 
DI (FH) Michael Friedrich
michael.friedrich at univie.ac.at
Tel: +43 1 4277 14359

Vienna University Computer Center
Universitaetsstrasse 7 
A-1010 Vienna, Austria  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20090922/87acdc5c/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Come build with us! The BlackBerry® Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9-12, 2009. Register now!
http://p.sf.net/sfu/devconf
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list