oscp command design and FIFO locking?

Fred f1216 at yahoo.com
Mon Sep 12 13:53:53 CEST 2005


Andreas,

Thank you for the comments.  This mail thread is getting visually ugly
because of the word wraps, so I'm not going to comment inline.  A few
more questions/clarifications follow:

SC_OPEN_MAX is probably what I am hitting, it is configured to be 1024
  on our system.  This could explain quite a bit.

To be clear, the issues that I originally described about writing to the
nagios.cmd fifo were not related in any way to the directly launched plugins
from nagios, i.e, I have no doubt that nagios does the right thing internally
to insure consistency.   What nagios has no control over is essentially async
processes that are writing to the nagios.cmd fifo with the intent of providing
passive check input to nagios.  i.e.,

echo "a bunch of lines of passive-check-results ... " >>nagios.cmd

while nagios is running (especially if nagios is also writing its own
active check results here!) could cause lots of trouble if there are no
observed locks.  The above is essentially what happens in my system (where
the echo is a really a set of perl scripts that all take turns writing the
fifo)

This was the reasons behind my question #2 about FIFO corruption.

Again, thank you for the SC_OPEN_MAX pointer ..., I think what may have caused
my problems may have been a recent addition of host-checks, this will cause
more open descriptors that previously used and may have pushed things over the
edge.


-FredC





--- Andreas Ericsson <ae at op5.se> wrote:

> Fred wrote:
> >>>This causes the design of the submit command to need to throttle the
> >>>access
> >>>to whatever resources it might need to touch.  If using the default
> >>>send_nsca
> >>>command, there can now be multiple (and many multiple) send_nsca's
> >>
> >>kicked
> >>
> >>>off
> >>>and each of these on the target server will all be attempting to write
> >>
> >>to
> >>
> >>>the nagios FIFO.  The nagios FIFO can get horribly overloaded.  If the
> >>>nagios
> >>>master demon is not aggresively reading the FIFO
> >>
> >>(check_command_interval=-
> >>
> >>>1)
> >>>then the demons can stack up and eventually consume socket resources
> >>
> >>and
> >>
> >>I handle approximately 3300 passive checks every 5 minutes on somewhat
> >>commodity hardware (quad pIII 800) using NSCA with no problems. I
> >>anticipate that I can double and possibly triple that number as the FIFO
> >>is empty approximately 1/3 of the time. Are you doing significantly more
> >>passive checks than that?
> >>
> > 
> > 
> > Most likely ... on one installation I have over 1040 nodes, over 10,500 
> > checks, 99% of which are passive and involve plug-ins which write to the
> > nagios.cmd FIFO.  Each compute node defines 10 passive service check 
> > definitions, each service node defines an additional 10 active checks.
> > 
> > The nsca demon forks children to write to nagios.cmd as
> > a result of a send_nsca connection request.  If at the same time, some
> plug-in
> > tries to write to this file, there is a good chance that the buffers can
> > be interspersed if both the nsca process and the plug-in do not observe any
> > kind of lock mechanism.    This can also occur when nagios forks off
> multiple
> > service check plug-ins that both want to write to the FIFO.  It took a
> system
> > configuration of about 120 or
> > so nodes for this to start happening for me.  It wasn't consistent and it
> > isn't fatal.  If you looked closely, the nagios.log would report an invalid
> > command and then read the next line of the FIFO and move on, however, the
> > data from that line would be lost.   Since implementing a lock around
> writing
> > to the FIFO from all my plug-ins, this has not occurred.  Note, in my
> smaller
> > configurations, I don't use nsca as there is no distributed monitoring. 
> The
> > contention in these smaller systems is between concurrently running
> plug-ins.
> > 
> 
> If you read the code you'll notice that the active checks also write 
> their service results to the FIFO. This is a showstopper on the road to 
> "scale like hell", so a few various other methods are being tested. 
> Multiplexing several children from a single parent seems the way to go. 
> 509 checks can run smoothly at once on a modern system (round about 1017 
> if you don't let the child have an stderr). The limit is set by 
> sysconf(_SC_OPEN_MAX) / 2, or sysconf(_SC_CHILD_MAX), whichever is lowest.
> 
> > 
> >>>memory etc.   As far as I can tell, nsca doesn't lock the FIFO, which
> >>
> >>also
> >>
> >>>means that writes will get intermixed with writes from plug-ins that
> >>
> >>might
> >>
> >>>be
> >>>running on the master system.  (I have seen this over and over)
> >>
> >>I don't see how. Local active checks, at least the standard plugins,
> >>don't use nagios.cmd in any way.
> 
> 
> This is incorrect. See above.
> 
> 
> > This would also be contrary to the
> >>blocking behavior you comment on above where your OS is essentially
> >>'locking' the FIFO until it has been cleared. As far as your OS is
> >>concerned, there is no distinction between NSCA trying to write to the
> >>pipe and some other process doing the same. While others are more versed
> >>in this than I am, it is my understanding that if the program is trying
> >>to write more data to the pipe than it can currently hold it will be
> >>prevented from doing so by the OS, only one process can write to the
> >>FIFO at a time and that all writes are atomic. This presumes that the
> >>plugin output is < the max FIFO length supported by your OS.
> > 
> 
> Actually, the write(2) command will write some data, but not all. The 
> smallest guaranteed atomic write size is 512 on posix systems. 
> Obviously, this is larger on most, but it can't be infinite so all 
> writes aren't atomic.
> 
> 
> > 
> > I use few local active checks.  Those that I do use, typically are kicked
> > off to generate per-node data that is written to the nagios.cmd FIFO, one
> > line item for each node.  With the FIFO on a 4k block filesystem, that
> isn't
> > too much room before it fills.  At about 80-120 chars per message, it only 
> > takes 30-50 messages
> > to fill the FIFO then the plug-in is blocked waiting for nagios to read it.
> > If nagios only reads it every 15 seconds, it could easily take over a
> minute
> > to read 128 messages (128 nodes).
> 
> 
> So set service_result_reaper_frequency (or some such) to 2. Having it at 
> 15 in a large environment just won't work.
> 
> 
> >  More then one process can write to a FIFO
> > at a time, it is just a unix file opened for append.  The OS doesn't
> control
> > this, the user application has to.   It gets worse ... if nagios spins off 
> > more then one plug-in that in turn writes to the FIFO, and each of those
> > want to write say 128 lines of data, they can easily toast each other. 
> Nagios
> > does have a setting to keep the number of concurrent processes to 1, but
> that
> > seems to be too big a hammer for this problem.   In any case, locking
> between
> > plug-ins (and wrapping any existing ones with locks) works well.  I also
> set
> > my nagios demon to aggresively read from the FIFO, otherwise things start
> > timing out (with a service check timeout at say 60-120 seconds)
> > 
> > While I have few local checks, they are the core of my monitoring system as
> > they are resposnible for filling in all the per-node information for the
> > majority of the passive checks, for example, I have a syslog monitor plugin
> > that runs and parses the recent syslog messages, compares against
> interesting
> > patterns, and then formats a line for each node that has something
> interesting
> > and writes that to the FIFO, for those nodes that do not have any
> interesting
> > content, it formats a line that says nothing matched (if I didn't do that,
> the
> > service check would never fill any data in or it would go stale)   Other
> > plug-ins report per-node statistics and format this into the FIFO.  Each
> node
> > has passive check definitions for these results.
> > 
> > 
> >>>To avoid this, I have had to implement serious locking in all plug-ins
> >>
> >>and
> >>
> >>>not use nsca as it has no locking mechanism (that I know of).
> >>
> 
> A better solution would have been to implement a local UDP socket 
> mechanism. The reaper in Nagios can easily multiplex, and the receive 
> buffers on sockets can be dynamically increased from the program 
> creating it (up to at least 65536 bytes even on very old linuxes).
> 
> 
> >>I'm curious about how you've done this. What exactly are you locking?
> >>How is it helping? NSCA shouldn't need locking as it depends on your OS
> >>to control access to the FIFO.
> >> 
> >>
> >>>Right now I am fighting with the oscp commands that can launch dozens
> >>
> >>of
> >>
> >>>copies at a time and each of these (in my case) write to a local file
> >>
> >>that
> >>
> >>>will eventually be pushed up to the master and written (while locking)
> >>
> >>the
> >>
> >>>nagios FIFO.
> >>>
> >>>So ... I guess my questions are:
> >>>
> >>>1) Should nagios be forking off more then one oscp command at a time?
> >>
> >>Yes, one per check.
> >>
> >>
> >>>2) Has anyone else run into FIFO corruption because of the lack of
> >>>advisory
> >>>   locking in all the plug-ins?
> >>
> 
> This is quite a misplaced question. The plugins just write to a 
> file-descriptor they think is stdout, but is really a pipe opened by 
> nagios (using the pipe(2) syscall) specifically for that plugin. That 
> pipe doesn't get filled as only one plugin is writing to it. It's nagios 
> itself that writes to its own FIFO.
> 
> 
> >>Not here in almost 4 years of using Nagios/Netsaint.
> > 
> > 
> > Again, thanks for the input.  
> > 
> > 
> >>--
> >>Marc
> >>
> >>
> >>-------------------------------------------------------
> >>SF.Net email is Sponsored by the Better Software Conference & EXPO
> >>September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> >>Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> >>Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> >>_______________________________________________
> >>Nagios-users mailing list
> >>Nagios-users at lists.sourceforge.net
> >>https://lists.sourceforge.net/lists/listinfo/nagios-users
> >>::: Please include Nagios version, plugin version (-v) and OS when
> reporting
> >>any issue.
> >>::: Messages without supporting info will risk being sent to /dev/null
> >>
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > -------------------------------------------------------
> > SF.Net email is Sponsored by the Better Software Conference & EXPO
> > September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> > Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> > Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> > _______________________________________________
> > Nagios-users mailing list
> > Nagios-users at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nagios-users
> > ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue. 
> > ::: Messages without supporting info will risk being sent to /dev/null
> > 
> 
> -- 
> Andreas Ericsson                   andreas.ericsson at op5.se
> OP5 AB                             www.op5.se
> Lead Developer
> 
> 
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting
> any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
> 







-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list