High cost for Nagios to spawn processes at Solaris 10 / Sun Fire T1000?

Thomas Guyot-Sionnest dermoth at aei.ca
Thu Oct 4 02:32:26 CEST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 29/09/07 04:16 PM, Steffen Poulsen wrote:
> Hi again,
> 
> Tonight one of our distributed servers took a break, and after it was up again, I noticed at our master server that everything seemed to have halted there.
> 
> The distributed server is responsible for reporting in ~3500 of our checks, and we have a check_freshness option for these at the master server.
> 
> So, what seems to have happened is that the freshness check starts to time out and each check triggers a check_dummy setting an unknown state. And at this particular server apparently is able to do this with about 1 to 4 checks / second -> unusable platform until things stabilize again. .. Which will probably never happen, as it will spend too much time doing service alerts and too little time processing external commands (nsca passive check results) :-/
> 
> We are running with retention data enabled, so we figured the only way to get out of this situation was to delete retention dat, to allow for a "fresh" start from PENDING.
> 
> So, in short - _everything_ that involves a shell exit, is not parallelized and hits more than a few checks at a time appears to break our setup at this platform as is?
> 
> I guess this T1000 platform is rather special, is is a "4 cores, 32 threads"-kind of thing - could it be that all of this parallelization has the exact opposite effect than what we were hoping? Possibly a synchronization issue on the process spawn?
> 
> I guess our best option is to go x86 asap unless someone can enlighten us on this issue. If anybody else is running Nagios on similar hardware but without any issues, please speak up.

I'm trying to follow you there... So when your Master server started to
run Freshness checks it wasn't able to check more that 4 check/sec?

You should look at all the bottlenecks. Apart from allowing unlimited
parallel checks, there are others:

- - On all non-OK service result, a host check is run in a serial way
(i.e. Nagios will wait on this host check before going on and processing
other service results). To overcome this problem I configured host check
commands with 1 sec timeouts, but this still can be a problem when many
hosts are down. This shouldn't be a problem if the host have active
checks disabled.

- - Performance data processing commands and OC[HS]P commands are
serialized as well.

About getting a x86, you can probably run Linux on your Solaris server....

> Best regards,
> Steffen Poulsen
> 
> BTW: We solved the performance data issue by having Nagios write them to file as suggested and putting a simple perl tail at it:

A pipe works well too. See this to get some inspiration:
http://www.nagioscommunity.org/wiki/index.php/OCP_Daemon
http://www.nagiosexchange.org/Misc.36.0.html?&tx_netnagext_pi1%5Bp_view%5D=972

Good luck,

Thomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHBDSa6dZ+Kt5BchYRAmr6AJkBmjGKO1ODk5sQ174mq58yQNZ+5wCg59j/
rVC5zpVg4lw1wsyqNkAd97I=
=2UDz
-----END PGP SIGNATURE-----

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/




More information about the Developers mailing list