High cost for Nagios to spawn processes at Solaris 10 / Sun Fire T1000?

Steffen Poulsen step at tdc.dk
Sat Sep 29 22:16:36 CEST 2007


Hi again,

Tonight one of our distributed servers took a break, and after it was up again, I noticed at our master server that everything seemed to have halted there.

The distributed server is responsible for reporting in ~3500 of our checks, and we have a check_freshness option for these at the master server.

So, what seems to have happened is that the freshness check starts to time out and each check triggers a check_dummy setting an unknown state. And at this particular server apparently is able to do this with about 1 to 4 checks / second -> unusable platform until things stabilize again. .. Which will probably never happen, as it will spend too much time doing service alerts and too little time processing external commands (nsca passive check results) :-/

We are running with retention data enabled, so we figured the only way to get out of this situation was to delete retention dat, to allow for a "fresh" start from PENDING.

So, in short - _everything_ that involves a shell exit, is not parallelized and hits more than a few checks at a time appears to break our setup at this platform as is?

I guess this T1000 platform is rather special, is is a "4 cores, 32 threads"-kind of thing - could it be that all of this parallelization has the exact opposite effect than what we were hoping? Possibly a synchronization issue on the process spawn?

I guess our best option is to go x86 asap unless someone can enlighten us on this issue. If anybody else is running Nagios on similar hardware but without any issues, please speak up.

Best regards,
Steffen Poulsen

BTW: We solved the performance data issue by having Nagios write them to file as suggested and putting a simple perl tail at it:

#!/usr/bin/perl

use File::Tail;
#use strict;
use IO::Socket::INET;

my $debug = 0;
my $logFileName = "/usr/local/nagios/var/service-perfdata";
my $nagServ = "11.11.11.11";
my $nagPort = "5667";
my $line;
my $file;
my $MySocket;

$file=File::Tail->new(name=>$logFileName, maxinterval=>1);
while (defined($line=$file->read)) {
    print "Received: $line \n" if $debug;
    $line = "" . $line;
    ( $dummy1, $dummy2, $host, $service, $dummy3, $dummy4, $perf, $perfdata ) = split( /\t/, $line );

    $send = "$host\t$service\t$perf\t$perfdata";
    $MySocket=new IO::Socket::INET->new(PeerPort=>$nagPort, Proto=>'udp', PeerAddr=>$nagServ);
    $MySocket->send($send);
    $MySocket->close();

}

> -----Oprindelig meddelelse-----
> Fra: nagios-devel-bounces at lists.sourceforge.net 
> [mailto:nagios-devel-bounces at lists.sourceforge.net] På vegne 
> af Hendrik Bäcker
> Sendt: 27. september 2007 14:30
> Til: Nagios Developers List
> Emne: Re: [Nagios-devel] Extremely bad performance when 
> enabling process_performance_data on Solaris 10?
> 
> Andreas Ericsson schrieb:
> > 
> > I haven't run into it, but I would solve it with a NEB-module that 
> > sends the performance data to a graphing server. It's really quite 
> > trivial to do, and a send(2) call generally finishes quickly enough.
> > 
> 
> Might be wrong, but a long time ago I've investigated some 
> time to write an NEB Mod that should call send_nsca to get 
> rid of the blocking ocsp command.
> I've found out that even a neb module is blocking too.
> 
> Just want to say that you should take care of the time your 
> neb module is doing something.
> 
> But as Andreas said: a send() should be faster than an 
> popen() that I did in the past.
> 
> Just my 2 Cents.
> 
> Hendrik
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/




More information about the Developers mailing list