18000 services to check and Nagios just sits and waits.

Marc Powell marc at ena.com
Wed Dec 10 16:35:06 CET 2003
Previous message: check_mailq with sendmail
Next message: 18000 services to check and Nagios just sits and waits.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Congratulations. You have the largest single-host installation of nagios
that I have heard of. You haven't given any hard information about how
you have nagios configured so we're going to have to make a lot of
guesses until you provide specific information. Your host and service
check definitions as well as many options in nagios.cfg can have a
profound affect on the performance of the program. Here are some
suggestions from my personal experience which may or may not be
redundant for you:

Nagios.cfg -
	- command_check_interval=-1 (may have no affect in your setup)
	- max_concurrent_checks=xxx (You should run '/path/to/nagios -s
/path/to/nagios.cfg) for a lower estimate on this number. Increasing it
will not hurt to a point)
	- service_reaper_frequency=2 (or 1 if you want, I'd start with
2)
	- use_agressive_host_checking=0
	- aggregate_status_updates=1
	- status_update_interval=xxx (I suggest at least 60). This one
may actually be getting you. With 18000 services, it's going to take
some time to update the status in the db, even if it is in ram. Nagios
does a delete of the status tables, then an insert of the new
information. If you have the interval set at 30 seconds and the process
takes 29 seconds, that's all that nagios will be doing or it will only
have 30 seconds to process several hundred or thousand results).
	- inter_check_delay_method. When using the smart option nagios
will try to spread out your checks so that they all fit in your average
check interval. If you don't have max_concurrent_checks set high enough
or the service_reaper_frequency set low enough to allow this to happen
the initial checks can get spread over a significant period of time.

Host and service checks -
	- Use very simple host checks, single pings for example, with no
retry or disable host checks entirely. If any service returns a state
other than OK, nagios will aggressively check the status of the host and
stop doing everything else until max_retries has been reached on the
host check.
	- Use a sane check_interval. Don't expect nagios to be able to
complete 18,000 checks at 1 minute intervals.
	- Your custom plugin should be written in C or if it's perl you
should use the ePN. If it's written in perl it can be very expensive
without ePN as you need to launch a copy of perl every check as well.
	- If you're utilizing parenting or service dependencies these
may be problematic with large numbers of hosts/services (just guessing).
	- I'm not a programmer but I don't believe that just because
linux understand hyperthreading that a program will take advantage of
it.

Ulimits - by default, Redhat linux 7.3 only allows a user to have 1024
open files, a stack size of 8192 kbytes and 7168 concurrent processes.
You may need to adjust these once you get things going.

Presuming that you don't make any significant changes based on the
suggestions above, is there anything in nagios.log that might indicate a
problem? Have you tried running strace on any of the nagios processes to
find out exactly what they are doing?

Finally, since you've created your own front end I presume you've
realized that nagios pre-2.0 had a hard time with large numbers of hosts
and services, particularly for the cgis. 2.0 (will) incorporates several
changes that reportedly make working with large numbers of hosts and
services better. YMMV though and as far as I know the enhancements
mostly benefit the cgis.

--
Marc

> -----Original Message-----
> From: martin at idefix.net [mailto:martin at idefix.net]
> Sent: Wednesday, December 10, 2003 3:55 A 
> To: nagios-users at lists.sourceforge.net
> Subject: [Nagios-users] 18000 services to check and Nagios just sits
and
> waits.
> 
> Hi all,
> 
> I'm trying to convince Nagios it should perform very aggressively
> but somehow it won't work.
> When reading the documentation it states everywhere that Nagios
> will consume all CPU power you throw at it if you don't take care.
> Well, with me it doesn't and I really want it to.
> 
> The situation:
> - All our machines send some email to the Nagios server which
>   we put in files and wrote a plugin to check those files.
> 
> - There are a lot of machines (almost 900) and we want to do a lot
>   of checks (18000).
> 
> - To make it worse, we forced Nagios to use MySQL for the
service_status
>   and host_status data (as we created our own frontend and use MySQL
as
>   the interface).
> 
> To make sure Nagios will be able to abuse the hardware as much as it
can
> we threw in a dual xeon 3 GHz machine with 2GB memory and some 15k RPM
> SCSI disks. To make it better, Linux understands hyperthreading and
> makes it a total of 4 CPU's.
> To prevent MySQL to abuse the arraycontroller to much we make the
> service_status and host_status tables HEAP so they only use memory.
> 
> I would assume that Nagios would at least try to fork something like
> 40 to 100 processes and would consume at least one CPU but it doesn't.
> It won't abuse the memory either as there is about 1GB of memory left.
> 
> It only seems to be sitting there with 4 to 6 proccesses and allowing
> the latency to go up and up like there's no tomorrow. Or at least
there
> won't be any checks tomorrow.
> 
> We've tried both the smart Nagios options as the dumb options and
> event tried to think ourselves and calculating the right configvalues
> but nothing seems to work.
>to /dev/null



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: check_mailq with sendmail
Next message: 18000 services to check and Nagios just sits and waits.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list