Optimising nagios

Jorgen Lundman lundman at gmo.jp
Thu Dec 9 04:45:52 CET 2004


Take two, sent it as the wrong email the first time. Moderators, you can just 
ignore it.


I do not know if we have a particularly large setup of Nagios, but I believe I
am starting to see effects of possibly having too many hosts and service checks.
The next-check events seems to lag behind more and more, and entering into pages
like "Status Summary" is very slow. (although, user responsiveness is not really
so important to me as the monitoring is.) Re-submitting a check immediately can
take 4-5 minutes before it takes effect.

Anyway, details are:

* Supermicro 6013, dual 2.4ghz, Solaris 9, 1G memory.

Load Avg generally between 2 and 3. (graph shows to be closer to 2, than 3, no
spikes). Which seem ideal on a dual system.



+++++++++++++++++++++++++++++++++++++++++++++++++++++++



Statistics:

* Hosts
* 46 Down 	0 Unreachable 	522 Up 	2 Pending

* Services
* 62 Critical 	26 Warning 	1 Unknown 	3370 Ok 	0 Pending

I changed it from testing services every 5 minutes to 10 minutes yesterday in an
attempt to quiet things down. I would rather have it be every 5 minutes, but if
that is too frequently, then it is how it will be.

Currently we only use Active checks, no Passive at all. At a guess, check_nrpe
is the most used command, there are some perl checks, but should not be a
majority. (on the local monitor machine I mean). Perhaps I should grep out the
execution history to see which would be executed the most.

I have been reading the optimise documentation, and it seems we are already
doing some (maybe even most) of the items suggested. I have the --emabedded-perl
option to try if there is not anything obviously wrong with our setup.

There are still some devices to be added, in particular, the network devices are
still not present.

There have started being gaps in the graphs which could be due to checks being
delayed? Or that is something unrelated..

I restarted it entirely today, just to clean things out, making sure it isn't
running twice etc.



How bad does it look?


Lund



+++++++++++++++++++++++++++++++++++++++++++++++++++++++


Nagios -s reports:

         SERVICE SCHEDULING INFORMATION
         -------------------------------
         Total services:             3459
         Total hosts:                570

         Command check interval:     -1 sec
         Check reaper interval:      4 sec

         Inter-check delay method:   SMART
         Average check interval:     600.867 sec
         Inter-check delay:          0.174 sec

         Interleave factor method:   SMART
         Average services per host:  6.068
         Service interleave factor:  7

         Initial service check scheduling info:
         --------------------------------------
         First scheduled check:      1102561317 -> Thu Dec  9 12:01:57 2004
         Last scheduled check:       1102561918 -> Thu Dec  9 12:11:58 2004

         Rough guidelines for max_concurrent_checks value:
         -------------------------------------------------
         Absolute minimum value:     24
         Recommend value:            72



+++++++++++++++++++++++++++++++++++++++++++++++++++++++



Current configuarion values are:

check_external_commands=1
command_check_interval=-1
command_file=/usr/local/nagios/var/rw/nagios.cmd
comment_file=/usr/local/nagios/var/comment.log
downtime_file=/usr/local/nagios/var/downtime.log
lock_file=/usr/local/nagios/var/nagios.lock
temp_file=/usr/local/nagios/var/nagios.tmp
log_rotation_method=m
log_archive_path=/usr/local/nagios/var/archives
use_syslog=0
log_notifications=0
log_service_retries=1
log_host_retries=1
log_event_handlers=1
log_initial_states=0
log_external_commands=1
log_passive_service_checks=1
inter_check_delay_method=s
service_interleave_factor=s
max_concurrent_checks=0
service_reaper_frequency=4
sleep_time=1
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=1
state_retention_file=/usr/local/nagios/var/status.sav
retention_update_interval=60
use_retained_program_state=0
interval_length=60
use_agressive_host_checking=0
execute_service_checks=1
accept_passive_service_checks=1
enable_notifications=1
enable_event_handlers=1
process_performance_data=1
service_perfdata_command=service-perf-data-handler
obsess_over_services=0
check_for_orphaned_services=0
check_service_freshness=1
freshness_check_interval=60
aggregate_status_updates=1
status_update_interval=15
enable_flap_detection=1
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0




+++++++++++++++++++++++++++++++++++++++++++++++++++++++




Typical template for hosts (actually, 100% all hosts):

   name                          generic-host
   notifications_enabled         1   ; Host notifications are enabled
   event_handler_enabled         1   ; Host event handler is enabled
   flap_detection_enabled        1   ; Flap detection is enabled
   process_perf_data             1   ; Process performance data
   retain_status_information     1   ; Retain status information across program
restarts
   retain_nonstatus_information  1   ; Retain non-status information across
program restarts
   max_check_attempts            10
   notification_interval         120
   notification_period           24x7
   notification_options          d,u,r




+++++++++++++++++++++++++++++++++++++++++++++++++++++++




Template for services, 100%

   name                          generic-service ; The 'name' of this service tem
plate, referenced in other service definitions
   active_checks_enabled         1               ; Active service checks are enab
led
   passive_checks_enabled        1               ; Passive service checks are ena
bled/accepted
   parallelize_check             1               ; Active service checks should b
e parallelized (disabling this can lead to major performance problems)
   obsess_over_service           1               ; We should obsess over this ser
vice (if necessary)
   check_freshness               0               ; Default is to NOT check servic
e 'freshness'
   notifications_enabled         1               ; Service notifications are enab
led
   event_handler_enabled         1               ; Service event handler is enabl
ed
   flap_detection_enabled        1               ; Flap detection is enabled
   process_perf_data             1               ; Process performance data
   retain_status_information     1               ; Retain status information acro
ss program restarts
   retain_nonstatus_information  1               ; Retain non-status information
across program restarts
   is_volatile                   0
   check_period                  24x7
   max_check_attempts            5
   normal_check_interval         10
   retry_check_interval          3


++++++++++++++++++++++++++++++++++++++++++++++++++++


extinfo.cgi output



Program-Wide Performance Information
Active Checks:
	
Time Frame	Checks Completed
<= 1 minute:	52 (1.5%)
<= 5 minutes:	52 (1.5%)
<= 15 minutes:	805 (23.3%)
<= 1 hour:	3458 (100.0%)
Since program start:  	2443 (70.6%)
	
Metric	Min.	Max.	Average
Check Execution Time:  	< 1 sec	60 sec	0.785 sec
Check Latency:	< 1 sec	2097 sec	604.420 sec
Percent State Change:	0.00%	6.25%	0.00%
Passive Checks:
	
Time Frame	Checks Completed
<= 1 minute:	0 (0.0%)
<= 5 minutes:	0 (0.0%)
<= 15 minutes:	0 (0.0%)
<= 1 hour:	0 (0.0%)
Since program start:  	0 (0.0%)









-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)


-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list