More passive problems

Jason Lancaster jlancaster at affinity.com
Sun May 11 08:12:28 CEST 2003


Dan, I've been dealing with a very similar problem in a large-scale
distributed environment for the past week. Initially, I was
bottle-necked by the ocsp command running on my "child" monitoring
servers. I was able to resolve that. (See my thread titled "distributed
monitoring/central server performance problems")

I have yet to resolve my central server's performance issues monitoring
anything more than about 1200 external commands. I figured this was due
to a cpu issue and was planning on upgrading my central server to a
faster system on Monday. Current central system is a dual 1.0ghz. I'm
assuming you're running a pair of p4's with hyper-threading because of
your comment about the quad processor.

While I may not have a quick answer for you, I'd like to at least
confirm we're both having the same issue. Mine lies with the processing
time it takes for Nagios to execute a service status check update. If my
terminology on that seems off, I'm talking about where Nagios forks the
additional Nagios child and waits for an update to write in the
status.log. I have so many external commands being written to Nagios a
second that the Nagios children queue until the system reaches a
breaking point in load. This can be fixed by limiting the amount of
Nagios processes, but then I delay my results even further, eventually
creating an infinite queue of results the system will never process.

I'm still hoping the cpu's are the bottleneck in my environment.

-Jason
-----Original Message-----
From: Dan Rich [mailto:drich at employees.org] 
Sent: Saturday, May 10, 2003 23:00
To: Jason Lancaster
Cc: drich at employees.org; nagios-users at lists.sourceforge.net
Subject: RE: [Nagios-users] More passive problems


Jason Lancaster said:
> Can you give us a better description of your environment? Are you only
> running one Nagios server or do you have a central server that this
> server is sending statistics to? What external command file
frequencies,
> interleaving, and aggregate methods are you using in your Nagios.cfg
> file?

Sure.  I have one Nagios server running two instances of Nagios (I
partitioned
the farm monitoring off to a separate server because it was just too
much
information on a single web page with both the 600+ farm systems and the
rest
of our servers).

The server itself is a dual proccessor system (that looks like a quad
processor to the OS), has 4GB of memory, a 100Mb network connection
(soon to
be 1Gb most likely), and also serves as our cricket and syslog server.
I
turned off the passive checks earlier today (the scripts still run and
update
my cricket server, they just don't pump any data into Nagios), and have
still
seen a few load spikes.  However, nothing as bad as what I experienced
last
night.

The farm monitor has 757 hosts and 2225 services.  1478 of those
services are
passive, updated via. two scripts.  At the moment, the farm instance has
404
processes running with a load of 1.43, all forks of the master nagios
process
as far as I can tell.

Here is my nagios.cfg file, less the cfg_file lines and comments:
log_file=/var/nagios/var/nagios-farm.log
status_file=/var/nagios/var/status-farm.log
nagios_user=nagios
nagios_group=nagios
check_external_commands=1
command_check_interval=-1
command_file=/var/nagios/var/rw/nagios-farm.cmd
comment_file=/var/nagios/var/comment-farm.log
downtime_file=/var/nagios/var/downtime-farm.log
lock_file=/var/nagios/var/nagios-farm.lock
temp_file=/var/nagios/var/nagios-farm.tmp
log_rotation_method=d
log_archive_path=/var/nagios/var/archives-farm
use_syslog=0
log_notifications=1
log_service_retries=1
log_host_retries=1
log_event_handlers=1
log_initial_states=0
log_external_commands=1
log_passive_service_checks=1
inter_check_delay_method=s
service_interleave_factor=s
max_concurrent_checks=0
service_reaper_frequency=10
sleep_time=1
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=1
state_retention_file=/var/nagios/var/status-farm.sav
retention_update_interval=60
use_retained_program_state=0
interval_length=60
use_agressive_host_checking=0
execute_service_checks=1
accept_passive_service_checks=1
enable_notifications=1
enable_event_handlers=1
process_performance_data=0
obsess_over_services=0
check_for_orphaned_services=0
check_service_freshness=1
freshness_check_interval=60
aggregate_status_updates=1
status_update_interval=15
enable_flap_detection=0
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0
date_format=us
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
illegal_macro_output_chars=`~$&|'"<>
admin_email=drich
admin_pager=pagenagios

> -----Original Message-----
> From: nagios-devel-admin at lists.sourceforge.net
> [mailto:nagios-devel-admin at lists.sourceforge.net] On Behalf Of Dan
Rich
> Sent: Saturday, May 10, 2003 15:20
> To: nagios-users at lists.sourceforge.net;
> nagios-devel at lists.sourceforge.net
> Subject: [Nagios-devel] More passive problems
>
>
> I am concerned with the way Nagios appears to handle passive alerts.
As
> I
> mentioned before, I am using a script to monitor a system farm of
> several
> hundred machines.  Every five minutes this script submits passive
checks
> for
> each machine into Nagios.
>
> Doing the above I frequently see many (for large values of many,
> sometimes >
> 100) of Nagios processes that are blocked on a lock file in the var
> directory.
>  It looks like this is due to the process that is reading the passive
> checks
> from the named pipe.  However, this has frequently led to system loads
> over
> 100, and this morning brought the system to a griding halt.
>
> Does anyone have any idea why the passive checks are causing this
> problem?  If
> I stop the cron job that generates the checks and restart Nagios the
> load goes
> away and doesn't return.  By whole point in doing this in the first
> place with
> passive checks was to avoid the load on the system caused by hundreds
of
> processes having to run every few minutes, but that seems to have
> backfired.
>
> --
> Dan Rich <drich at employees.org> |   http://www.employees.org/~drich/
>                                |  "Step up to red alert!"  "Are you
> sure, sir?
>                                |   It means changing the bulb in the
> sign..."
>                                |          - Red Dwarf (BBC)
>
>
>
> -------------------------------------------------------
> Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
> The only event dedicated to issues related to Linux enterprise
solutions
> www.enterpriselinuxforum.com
>
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>
>
>


-- 
Dan Rich <drich at employees.org> |   http://www.employees.org/~drich/
                               |  "Step up to red alert!"  "Are you
sure, sir?
                               |   It means changing the bulb in the
sign..."
                               |          - Red Dwarf (BBC)



-------------------------------------------------------
Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
The only event dedicated to issues related to Linux enterprise solutions
www.enterpriselinuxforum.com

_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list