More passive problems

Dan Rich drich at employees.org
Sun May 11 04:59:49 CEST 2003


Jason Lancaster said:
> Can you give us a better description of your environment? Are you only
> running one Nagios server or do you have a central server that this
> server is sending statistics to? What external command file frequencies,
> interleaving, and aggregate methods are you using in your Nagios.cfg
> file?

Sure.  I have one Nagios server running two instances of Nagios (I partitioned
the farm monitoring off to a separate server because it was just too much
information on a single web page with both the 600+ farm systems and the rest
of our servers).

The server itself is a dual proccessor system (that looks like a quad
processor to the OS), has 4GB of memory, a 100Mb network connection (soon to
be 1Gb most likely), and also serves as our cricket and syslog server.  I
turned off the passive checks earlier today (the scripts still run and update
my cricket server, they just don't pump any data into Nagios), and have still
seen a few load spikes.  However, nothing as bad as what I experienced last
night.

The farm monitor has 757 hosts and 2225 services.  1478 of those services are
passive, updated via. two scripts.  At the moment, the farm instance has 404
processes running with a load of 1.43, all forks of the master nagios process
as far as I can tell.

Here is my nagios.cfg file, less the cfg_file lines and comments:
log_file=/var/nagios/var/nagios-farm.log
status_file=/var/nagios/var/status-farm.log
nagios_user=nagios
nagios_group=nagios
check_external_commands=1
command_check_interval=-1
command_file=/var/nagios/var/rw/nagios-farm.cmd
comment_file=/var/nagios/var/comment-farm.log
downtime_file=/var/nagios/var/downtime-farm.log
lock_file=/var/nagios/var/nagios-farm.lock
temp_file=/var/nagios/var/nagios-farm.tmp
log_rotation_method=d
log_archive_path=/var/nagios/var/archives-farm
use_syslog=0
log_notifications=1
log_service_retries=1
log_host_retries=1
log_event_handlers=1
log_initial_states=0
log_external_commands=1
log_passive_service_checks=1
inter_check_delay_method=s
service_interleave_factor=s
max_concurrent_checks=0
service_reaper_frequency=10
sleep_time=1
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=1
state_retention_file=/var/nagios/var/status-farm.sav
retention_update_interval=60
use_retained_program_state=0
interval_length=60
use_agressive_host_checking=0
execute_service_checks=1
accept_passive_service_checks=1
enable_notifications=1
enable_event_handlers=1
process_performance_data=0
obsess_over_services=0
check_for_orphaned_services=0
check_service_freshness=1
freshness_check_interval=60
aggregate_status_updates=1
status_update_interval=15
enable_flap_detection=0
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0
date_format=us
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
illegal_macro_output_chars=`~$&|'"<>
admin_email=drich
admin_pager=pagenagios

> -----Original Message-----
> From: nagios-devel-admin at lists.sourceforge.net
> [mailto:nagios-devel-admin at lists.sourceforge.net] On Behalf Of Dan Rich
> Sent: Saturday, May 10, 2003 15:20
> To: nagios-users at lists.sourceforge.net;
> nagios-devel at lists.sourceforge.net
> Subject: [Nagios-devel] More passive problems
>
>
> I am concerned with the way Nagios appears to handle passive alerts.  As
> I
> mentioned before, I am using a script to monitor a system farm of
> several
> hundred machines.  Every five minutes this script submits passive checks
> for
> each machine into Nagios.
>
> Doing the above I frequently see many (for large values of many,
> sometimes >
> 100) of Nagios processes that are blocked on a lock file in the var
> directory.
>  It looks like this is due to the process that is reading the passive
> checks
> from the named pipe.  However, this has frequently led to system loads
> over
> 100, and this morning brought the system to a griding halt.
>
> Does anyone have any idea why the passive checks are causing this
> problem?  If
> I stop the cron job that generates the checks and restart Nagios the
> load goes
> away and doesn't return.  By whole point in doing this in the first
> place with
> passive checks was to avoid the load on the system caused by hundreds of
> processes having to run every few minutes, but that seems to have
> backfired.
>
> --
> Dan Rich <drich at employees.org> |   http://www.employees.org/~drich/
>                                |  "Step up to red alert!"  "Are you
> sure, sir?
>                                |   It means changing the bulb in the
> sign..."
>                                |          - Red Dwarf (BBC)
>
>
>
> -------------------------------------------------------
> Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
> The only event dedicated to issues related to Linux enterprise solutions
> www.enterpriselinuxforum.com
>
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>
>
>


-- 
Dan Rich <drich at employees.org> |   http://www.employees.org/~drich/
                               |  "Step up to red alert!"  "Are you sure, sir?
                               |   It means changing the bulb in the sign..."
                               |          - Red Dwarf (BBC)



-------------------------------------------------------
Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
The only event dedicated to issues related to Linux enterprise solutions
www.enterpriselinuxforum.com

_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list