Problem with OCP_daemon in distributesenvironment

Michel van der Voort michel.vdv at wxs.nl
Wed Aug 17 20:29:45 CEST 2011


Hello again Graig,

Thanks once more and yes you're making sense.
But this is also the reason why I can't pinpoint on what's going wrong, it
should just be able to work especially because nothing changes on the
central server only the amount of check results from the OCP_daemon machine
maybe.
We indeed have a number of remote machines all doing their checks locally en
sending them to our central server via nsca/nscad and so was (and now is
again) the one I tried to configure with OCP_daemon.
I know offcourse that OCP_daemon uses the underlying (and unchanged)
send_nsca config and binary and that all works well.
The latency's I had before on this OCP_daemon machine made me experiment
with it OCP because a lag in messages/performance data appearing on the
central server of about 10 - 15 minutes was unacceptable.
This also caused WARNINGS on certain checks on the OCP machine because these
have to run frequent enough to not be causing counter overflows on 64-bit
counters for some network devices we monitor.

Also, debugging nsca on the central receiving end shows everything working
find, data coming in from the OCP_daemon machine as well as other still
nsca/nscad'ing other remote machines.
The only thing that stops working are the central servers own ACTIVE checks
of which no other Nagios machine even knows about.

I guess I have to do even more research and debugging on the central server.
I would expect that machine would be getting high CPU, memory or I/O
indications when something like to fast incoming data would be the issue,
but there are no indications.
Just that after 2 hours all local checks have a last execution timestamp of
2 hours and the check_ processes also really don't get fired anymore.
I've switched back to standard obsessing with send_nsca again on the
OCP_daemon machine, restarted nagios on the central server and everything's
working again but unfortunately with high check intervals and latency on the
OCP_daemon machine.
The only thing I noticed was the higher number of buffer slots used on the
external command file where process_perfdata reads from and nscad writes to.

For now, thanks a lot.
I'm not really familiar with closing a topic on the Nagios Users List but I
will try to.
Also, if and when I find out more I will inform you.

Best regards,

Michel

-----Oorspronkelijk bericht-----
Van: Craig Stewart [mailto:Craig.Stewart at corp.xplornet.com] 
Verzonden: woensdag 17 augustus 2011 14:10
Aan: Nagios Users List
CC: michel.vdv at wxs.nl
Onderwerp: Re: [Nagios-users] Problem with OCP_daemon in
distributesenvironment

Michel,

Okay, I understand now.

So, if I get this correctly, when you were using the obsessing method,
everything was working fine from the central server's point of view, but
when you moved one remote unit from the obsessing to the OCP_daemon, the
central server stopped doing all active checks?

The way I have it set up here for my central/probe configuration is that
the central server accepts passive checks through the nscad process.  On
my remote servers they send in either via the OCP_daemon (which calls
send_nsca) or a custom obsess script.  There are no changes to my
central server.

So, unless you are doing something strange, you should be able to get it
going and executing active checks as well as accepting passive checks on
the central.  The method the probe uses, as long as it's consistent with
the way the central server picks up check (send_nsca/nscad in my case)
is independent of the central server.  If you get this working,
switching the probe from the obsess method to the OCP_daemon method
should not affect the central server, or even require a restart.

Am I making any sense here or have I confused the issue?

Craig
--
Craig Stewart
Systems Integration Analyst
Craig.Stewart at corp.xplornet.com
Xplornet - Broadband, Everywhere

On 08/16/2011 05:02 PM, michel.vdv at wxs.nl wrote:
> Hello Craig,
>  
> First of all thanks for the fast response.
> Maybe i need to clear things out a bit more to why ACTIVE checks are
> happening on the central server.
> We have a distributed setup with a central machine in DMZ reachable for
> all remote nagios machines we have out there.
> One of those is the LAN machine i mentioned where OCP_daemon was setup
> today.
> The central Nagios machine in DMZ should/must perform active checks of
> all our equipment in the same DMZ, the others hosts only send passive
data.
> The DMZ machine cannot perform ACTIVE checks on the services monitored
> by 1 or more of the remote machines.
> So, this is why there is a problem when the central server does not
> perform it's own checks.
>  
> I've been testing around with repear frequencies on the central server
> because i saw reaper frequency exceeded messages in the nagios.debug
> (-1) output.
> These now stay away but the result is still te same.
> Also lowered the frequency of all template related check_interval's on
> the OCP_daemon remote machine but that does not help either.
>  
> If you have any more suggestions, please let me know.
>  
> Regards,
>  
> Michel
> ------------------------------------------------------------------------
> *Van:* Craig Stewart [mailto:Craig.Stewart at corp.xplornet.com]
> *Verzonden:* di 16-8-2011 21:47
> *Aan:* Nagios Users List
> *CC:* michel.vdv at wxs.nl
> *Onderwerp:* Re: [Nagios-users] Problem with OCP_daemon in distributes
> environment
> 
> Michel,
> 
> I just did the same thing for my set up and I didn't see this issue.
> That being said, I don't *want* the central master to execute service
> checks at all unless it's stale.
> 
> What may be happening is that the remote passive check may be getting
> inserted while the system is waiting to execute the next check.  This is
> probably resetting the clock as it were and the count down starts over.
> 
> For example:
> 
> - NOW is an arbitrary point in time.
> - Nagios schedules the check to be executed at NOW + 5 min. (recheck
> interval)
> - The passive check comes in at NOW + 3 min.  Nagios resets the clock to
> NOW + 3 min + check interval.
> 
> If the remote is submitting checks at a frequency less than the
> central's recheck interval, I can see this happening.  The clock never
> runs out, unless the remote system doesn't submit a check.
> 
> A couple things to check are the check intervals on both the central and
> the probe, and if you can tolerate the  hit, shut down the probe and see
> if the central server starts executing checks on it's own.
> 
> I may be out in left field as well.
> 
> Cheers!
> 
> Craig
> --
> Craig Stewart
> Systems Integration Analyst
> Craig.Stewart at corp.xplornet.com
> Xplornet - Broadband, Everywhere
> 
> On 08/16/2011 04:22 PM, michel.vdv at wxs.nl wrote:
>> Dear readers,
>> 
>> I have a strange problem related to the use of OCP_daemon.
>> I've implemented this today on a "remote" nagios machine responsible for
>> monitoring our LAN hosts.
>> Until now all messages and performance data was sent from that machine
>> to our Central Nagios machine via obsess_over_hosts and
>> obsess_over_services.
>> But because a lot of services on the remote host combined with relative
>> short check_interval periods caused high service and host check
>> latencies i've started looking for an alternative and read about
> OCP_daemon.
>> I followed the install instructions and sending data via OCP_daemon
>> works fine and very fast, also the remote nagios machine's latencies
>> stay low.
>> However, the central server stays processing all passive service and
>> host check results (also from other send_nsca based hosts) but no longer
>> executes it's own ACTIVE checks.
>> Is soon as i stop nagios on the remote monitor and restart nagios on the
>> central server it starts executing ACTIVE checks again.
>> The load on both servers remained about the same since OCP_daemon and
>> the only thing i noticed is that the number of buffers/slots used for
>> the external command file (nagios.cmd) on the central server
>> reaches rather higher values than before but no more than 30 - 40% of
>> the available 4096 slots.
>> 
>> Please advice me.
>> 
>> Michel
>> 
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is
>> believed to be clean.
> 
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is
> believed to be clean.


------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system, 
user administration capabilities and model configuration. Take 
the hassle out of deploying and managing Subversion and the 
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list