Patches for improved NEB control

Hendrik Baecker b00mer at gmx.net
Thu Oct 26 17:42:56 CEST 2006


Hi Bob,

this sounds really good for "advanced distribute monitoring".

Perhaps you can write a litte bit more about what to do, if possible, if
you want to include this into failover monitoring for redundance purposes?

Hendrik

bobi at netshel.net schrieb:
> Attached is a patch-set I would like some feedback on.
>
> The purpose of this patch is to allow Nagios the ability to delegate the
> execution of service checks to a NEB module.
>
> Why would we want to do this?  I'm glad you asked...
>
> The point is to allow Nagios to scale efficiently in large-scale
> environments by delegating service checks to multi-node "check" clusters. 
> That is, it facilitates the creation of a Nagios Service Check Cluster (or
> multiple independent clusters,) that can be deployed in either one
> location or multiple locations.
>
> The benefits are:
>
> 1. It de-couples Service Check execution from Scheduling on the same box. 
> Sure, you can do this by setting up multiple Nagios instances that report
> their results passivley back up to the "master" Nagios box, but that
> requires manually splitting up you configuration among multiple Nagios
> instances, setting up all of the passive result reporting, etc.
>
> In this scenario, you can keep your centrally-located master configuration
> file and have the service check distributed to light-weight,
> geographically-dispersed service check clusters.
>
> 2. Scalability.  You can support more simultaneous service checks by
> adding more light-weight service check nodes incrementally.
>
> You can start with zero external nodes (i.e., all checks still executed by
> Nagios internally.) Then add one node as you service check count
> increases.  Then gradually (or quickly,) increase the node count, locally
> or remotely, as your service check count grows, and the system will scale
> appropriately.
>
> Anyway, it's not the ultimate, end-all, be-all, but we have found it helps
> us scale and manage Nagios efficiently in our large-scale,
> multi-datacenter environment.  The hope is that this will be considered as
> a potential part of the new Nagios architecture some day.
>
> For those who want to know how Nagios actually delegates service check
> execution to an external cluster via a NEB module, here are the high-level
> details:
>
> We have written a multi-threaded NEB module that registers a 
> NEBCALLBACK_SERVICE_CHECK_DATA callback and watches for the
> NEBTYPE_SERVICECHECK_INITIATE event.
>
> It then takes each service check and distributes it across the network to
> multiple "worker" nodes in a cluster (via XML-RPC).  It also takes care of
> processing the check results, posting them to the internal Nagios result
> queue, plugin timeout conditions, etc.
>
> The way this works is that Nagios now checks the return code from NEB
> modules who are registered for the NEBCALLBACK_SERVICE_CHECK_DATA event.
>
> If the NEB module returns the "new" NEBERROR_CALLBACKOVERRIDE result code,
> Nagios "delegates" execution of the service check to the NEB module. 
> Otherwise, Nagios continues to execute the service check itself, as it
> normally does.
>
> So, the attached patch files enable this functionality.
>
> Note that this patch set does not include our multi-threaded NEB module
> (if you're interested in that, just e-mail me - it's meant to be open
> source.)  It just includes the patches to allow a NEB modules to override
> service check execution.
>
> This should be a pretty straightforward patch, and doesn't modify any
> functionality in the absence of the broker. We just need it to expand the
> flexibility of what a NEB module can do.
>
> Thanks,
> Bob
>   
> ------------------------------------------------------------------------
>
> --- /home/icsrwi/proj/nagios-2.4-ORIG/base/broker.c	2005-12-23 12:31:35.000000000 -0700
> +++ broker.c	2006-08-16 11:25:51.597024488 -0600
> @@ -293,17 +293,18 @@
>  
>  
>  /* send service check data to broker */
> -void broker_service_check(int type, int flags, int attr, service *svc, int check_type, struct timeval start_time, struct timeval end_time, char *command, double latency, double exectime, int timeout, int early_timeout, int retcode, char *cmdline, struct timeval *timestamp){
> +int broker_service_check(int type, int flags, int attr, service *svc, int check_type, struct timeval start_time, struct timeval end_time, char *command, double latency, double exectime, int timeout, int early_timeout, int retcode, char *cmdline, struct timeval *timestamp){
>  	char *command_buf=NULL;
>  	char *command_name=NULL;
>  	char *command_args=NULL;
>  	nebstruct_service_check_data ds;
> +	int ret;
>  
>  	if(!(event_broker_options & BROKER_SERVICE_CHECKS))
> -		return;
> +		return NEB_OK;
>  	
>  	if(svc==NULL)
> -		return;
> +		return NEB_ERROR;
>  
>  	/* get command name/args */
>  	if(command!=NULL){
> @@ -339,12 +340,12 @@
>  	ds.perf_data=svc->perf_data;
>  
>  	/* make callbacks */
> -	neb_make_callbacks(NEBCALLBACK_SERVICE_CHECK_DATA,(void *)&ds);
> +	ret = neb_make_callbacks(NEBCALLBACK_SERVICE_CHECK_DATA,(void *)&ds);
>  
>  	/* free data */
>  	free(command_buf);
>  
> -	return;
> +	return ret;
>          }
>  
>  
> ------------------------------------------------------------------------
>
> --- /home/icsrwi/proj/nagios-2.4-ORIG/include/broker.h	2005-12-23 12:31:36.000000000 -0700
> +++ broker.h	2006-08-16 11:33:30.723588858 -0600
> @@ -187,7 +187,7 @@
>  void broker_ocp_data(int,int,int,void *,int,int,double,int,int,struct timeval *);
>  void broker_system_command(int,int,int,struct timeval,struct timeval,double,int,int,int,char *,char *,struct timeval *);
>  void broker_host_check(int,int,int,host *,int,int,int,struct timeval,struct timeval,char *,double,double,int,int,int,char *,char *,char *,struct timeval *);
> -void broker_service_check(int,int,int,service *,int,struct timeval,struct timeval,char *,double,double,int,int,int,char *,struct timeval *);
> +int broker_service_check(int,int,int,service *,int,struct timeval,struct timeval,char *,double,double,int,int,int,char *,struct timeval *);
>  void broker_comment_data(int,int,int,int,int,char *,char *,time_t,char *,char *,int,int,int,time_t,unsigned long,struct timeval *);
>  void broker_downtime_data(int,int,int,int,char *,char *,time_t,char *,char *,time_t,time_t,int,unsigned long,unsigned long,unsigned long,struct timeval *);
>  void broker_flapping_data(int,int,int,int,void *,double,double,double,struct timeval *);
> ------------------------------------------------------------------------
>
> --- /home/icsrwi/proj/nagios-2.4-ORIG/base/checks.c	2006-02-15 21:47:55.000000000 -0700
> +++ checks.c	2006-08-16 11:32:28.309124928 -0600
> @@ -109,6 +109,7 @@
>  	FILE *fp;
>  	int pclose_result=0;
>  	int time_is_valid=TRUE;
> +	int neb_ret;
>  #ifdef EMBEDDEDPERL
>  	char fname[512];
>  	char *args[5] = {"",DO_CLEAN, "", "", NULL };
> @@ -268,7 +269,11 @@
>  	/* send data to event broker */
>  	end_time.tv_sec=0L;
>  	end_time.tv_usec=0L;
> -	broker_service_check(NEBTYPE_SERVICECHECK_INITIATE,NEBFLAG_NONE,NEBATTR_NONE,svc,SERVICE_CHECK_ACTIVE,start_time,end_time,svc->service_check_command,svc->latency,0.0,0,FALSE,0,processed_command,NULL);
> +	neb_ret = broker_service_check(NEBTYPE_SERVICECHECK_INITIATE,NEBFLAG_NONE,NEBATTR_NONE,svc,SERVICE_CHECK_ACTIVE,start_time,end_time,svc->service_check_command,svc->latency,0.0,0,FALSE,0,processed_command,NULL);
> +
> +	/* check for override from module callback */
> +	if (neb_ret == NEBERROR_CALLBACKOVERRIDE)
> +		return;
>  #endif
>  
>  #ifdef EMBEDDEDPERL
> ------------------------------------------------------------------------
>
> --- /home/icsrwi/proj/nagios-2.4-ORIG/include/neberrors.h	2005-11-25 20:52:07.000000000 -0700
> +++ neberrors.h	2006-08-16 10:59:47.123913345 -0600
> @@ -50,6 +50,7 @@
>  #define NEBERROR_CALLBACKNOTFOUND   203     /* the callback could not be found */
>  #define NEBERROR_NOMODULEHANDLE     204     /* no module handle specified */
>  #define NEBERROR_BADMODULEHANDLE    205     /* bad module handle */
> +#define NEBERROR_CALLBACKOVERRIDE   206     /* callback overrides Nagios handling of event */
>  
>  
>  
> ------------------------------------------------------------------------
>
> --- /home/icsrwi/proj/nagios-2.4-ORIG/base/nebmods.c	2006-04-05 16:33:31.000000000 -0600
> +++ nebmods.c	2006-08-16 11:42:35.534806405 -0600
> @@ -548,9 +548,11 @@
>  #ifdef DEBUG
>  		printf("Callback type %d resulted in return code of %d\n",callback_type,cbresult);
>  #endif
> +		if (cbresult == NEBERROR_CALLBACKOVERRIDE)
> +			break;	/* Bail-out early on an override result */
>  	        }
>  
> -	return OK;
> +	return cbresult;
>          }
>  
>  
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> ------------------------------------------------------------------------
>
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>   


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642




More information about the Developers mailing list