How to reduce a very high latency number

Trask trasko at gmail.com
Tue May 23 09:50:03 CEST 2006


On 5/22/06, srunschke at abit.de <srunschke at abit.de> wrote:
> nagios-users-admin at lists.sourceforge.net schrieb am 17.05.2006 20:09:16:
>
> To me this is obviously a performance issue related to hardware.
> Your machines have way too few RAM. It is totally not possible to
> run 1800 checks on a 512MB machine in a timely manner.
>

I figured this out this past Saturday.  It is not any lack of the
hardware.  I was seeing negligible load nor an excessive use of
memory.  No configuration change I made seemed to have any appreciable
effect on the latency times I was getting.  I ended up doing a "top"
with 1 second intervals and just watching it for awhile.  I noticed
that sometimes there would be a good number of nagios processes
20-30-40 or so, but the majority of the time there were only 2, 3 or 4
processes.  Although I do not know exactly *why* this was happening,
it ends up the during the time where there was 2-4 processes running,
2 of them were always the"submit_passive_check" script and
"send_nsca".  It appears that this is being done serially (ie not in
parallel) and ends up blocking subsequent checks until they are done.
I would see these 2 processes running (with steadily increasing PIDs)
for up to a minute and then a short-lived (4-5 seconds) "explosion" of
nagios processes (service/host checks).  After this flurry of
activity, it would be another 60 seconds or so of just 2-4 processes.

I resolved this problem by changing by "submit_passive_check" script.
Below are some sample scripts, both old and new.  The short of it is
like this:  Previously, the "submit_passive_check" script did a printf
of the data in the appropriate format and piped it to the "send_nsca"
command (in a shell script).  I have eliminated this bottleneck by
having "submit_passive_check" redirect its output to a named pipe and
then having another script feed "send_nsca" with that data as it comes
in to the named pipe.

Latency times have dropped from the 600-700 seconds to 0.2 seconds on
the worst server and from 45-55 seconds to 0.06 on the 2nd to worst.
That's more like it!

Below are a few scripts w/ notes as to what each one is.  Thanks to
everyone who offered help.

~trask



-------------------------------------------------
Note that I have stripped a lot of the comments to keep things short
and I have made little edits here and there for the sake of clarity
(or to protect the innocent) without re-testing these scripts.  <usual
disclaimers here>.  I hope someone finds these useful.

-------------------------------------------------
Old "submit_passive_check script" (nearly identical to the example in
the docs).  This script just formats the data from nagios in the
proper way and pipes it to the "send_nsca" program to send the results
to the central server.

#!/bin/sh
# Convert the state string to the corresponding return code
return_code=-1

case "$3" in
    OK)
        return_code=0
    ;;
    WARNING)
        return_code=1
    ;;
    CRITICAL)
        return_code=2
    ;;
    UNKNOWN)
        return_code=-1
    ;;
esac

# pipe the service check info into the send_nsca program, which
# in turn transmits the data to the nsca daemon on the central
# monitoring server

# $1: host
# $2: srvc
# $4: which nagios server this is coming from (to aid interface in
central server)
# $5: output from service check

/usr/bin/printf "%s\t%s\t%s\t[%s] %s\n" "$1" "$2" "$return_code" "$4" "$5"
  | /usr/local/nagios/bin/send_nsca -H nag4 -p 5669 -c
/usr/local/nagios/etc/send_nsca.cfg


-------------------------------------------------
New "submit_passive_check" script.  This redirects the data to a named
pipe -- it is much quicker to do this than wait for send_nsca to
return (or whatever the hang up is with using the above script).  It
is essentially the same as the above script, except for the 3 lines
below.  It perhaps would be faster using an embedded perl script, but
from my results it is clearly fast enough.


NSCAPIPE=/usr/local/nagios/var/rw/send_through_nsca

if [ ! -p "$NSCAPIPE" ]; then exit ; fi

/usr/bin/printf "%s\t%s\t%s\t[%s] %s\n" "$1" "$2" "$return_code" "$4"
"$5" > $NSCAPIPE

-------------------------------------------------
Perl script I called "nsca_listener.pl".  I am a little wary of
posting it because it is likely not too solid (I haven't done much
testing), but it has been working for about 14 hours now and it hasn't
made any zombies or otherwise misbehaved.  All it does is check the
named pipe "/usr/local/nagios/var/rw/send_through_nsca" for data and
pipes it to the "send_nsca" script.  Initially it did this serially,
but that was only able to send out a single result slightly faster
than 1 per second.  Using a fork I sped it up considerably, but I have
not added much in the way of sanity checking.  You can switch it back
to a serialized version by commenting out the &forkSendNSCA call and
uncommenting the 3 preceding lines.  Notes: 1) I totally ripped of the
fork code from some tutorial online; 2) uid and gid 5668 is the nagios
user in the chmod command; 3) comment out "Proc::Daemon::Init;" to get
it run in the foreground; 4) lastly, if you happen to have a file
named the same as the named pipe this script uses, it will delete it
and stick a named pipe there without asking you.


#!/usr/bin/perl

use Proc::Daemon;
use POSIX qw(setsid);
use strict;


my $NDIR      = "/usr/local/nagios";
my $FIFO      = "$NDIR/var/rw/send_through_nsca";
my $SEND_NSCA = "$NDIR/bin/send_nsca -n -H nag4 -p 5669 -c
/usr/local/nagios/etc/send_nsca.cfg";

# daemonize
Proc::Daemon::Init;

$SIG{'INT'} = sub { close(SN); exit;};


while (1) {
    &makeFifo($FIFO) unless (-p $FIFO);
    open(FIFO, "$FIFO") || die "cannot read from $FIFO: $!";
    while(<FIFO>) {
        #print STDERR "On line: $_";
        #open(SN,"|$SEND_NSCA"); # if eof(SN);
        #print SN;
        #close(SN);
        &forkSendNSCA($_);
    }
}


sub makeFifo {
    my ($FIFO) = @_;
    unless (-p $FIFO) {
        unlink $FIFO;
        system('mknod', $FIFO, 'p')
            && die "can't mknod $FIFO: $!";
    }
    chmod 0660, $FIFO;
    chown 5668, 5668, $FIFO;


}


sub forkSendNSCA {
    my ($t) = @_;

    $SIG{CHLD} = 'IGNORE';
    defined (my $kid = fork) or die "Cannot fork: $!\n";
    if ($kid) {
        #print STDERR "Parent $$ has finished, kid's PID: $kid\n";
    } else {
        chdir '/'                or die "Can't chdir to /: $!";
        open STDIN, '/dev/null'  or die "Can't read /dev/null: $!";
        open STDOUT, '>/dev/null'
            or die "Can't write to /dev/null: $!";
        open STDERR, '>/dev/null' or die "Can't write to /dev/null: $!";
        setsid or die "Can't start a new session: $!";

        select STDERR;
        local $| = 1;

        open(SN,"|$SEND_NSCA"); # if eof(SN);
        print SN "$t";
        close(SN);

        CORE::exit(0); # terminate the process
    }

}


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list