Nagios and Postgres

Marc Powell mpowell at ena.com
Tue Nov 26 17:33:05 CET 2002
Previous message: Check for Citrix TS?
Next message: Still looking for check_ping help
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hello All,
 
Is anyone out there successfully using Nagios with a Postgres backend
for a large installation under RedHat (7.3)? I have been trying like
heck to get it to work nicely and I haven't had much luck. Here are the
numbers:
 
Central Server #1:
Nagios 1.0b6 using a Postgres 7.2.3 (thanks to the recent postgres
timestamp patch) (also tried 1.0b3 with pg7.1 and pg7.2.3)
Redhat 7.3 on a quad processor PII 550 with 2.5 Gb RAM, 200+ Gb of
hardware Raid 5 disk space.
 
Central Server #2:
Nagios 1.0b3 using flat text file backend.
Redhat 6.2 on a Uni-processor PIII 800 with 512 Mb Ram and ~12 GB of
disk space (non-raid)
 
Neither of the above servers are performing active checks.
 
I have 4 Data Collector machines each polling and reporting via NSCA to
both servers above for 500 services (2000 total) with a 5 minute check
interval for each service.
 
Here is what I am seeing on the machine with the postgres backend:
 
9:21am  up 22:08,  6 users,  load average: 53.54, 726.96, 844.05
14336 processes: 14334 sleeping, 2 running, 0 zombie, 0 stopped
CPU0 states: 44.5% user,  5.11% system,  0.0% nice, 49.7% idle
CPU1 states: 26.5% user, 10.5% system,  0.0% nice, 63.1% idle
CPU2 states: 14.3% user,  2.3% system,  0.0% nice, 83.5% idle
CPU3 states: 24.7% user, 34.10% system,  0.0% nice, 40.7% idle
Mem:  2582276K av, 2572180K used,   10096K free,       0K shrd,   66476K
buff
Swap: 2097112K av,       4K used, 2097108K free                 1062052K
cached
 
At this point the machine is of course only somewhat usable. The other
machine, accepting the exact same passive service checks looks like
this:
 
9:25am  up 34 days, 20:19,  3 users,  load average: 1.94, 2.15, 2.07
149 processes: 147 sleeping, 2 running, 0 zombie, 0 stopped
CPU states:  0.0% user,  0.6% system,  0.0% nice,  0.2% idle
Mem:   516816K av,  265816K used,  251000K free,  230400K shrd,   25960K
buff
Swap:  530104K av,    9840K used,  520264K free                  144908K
cached
 
My major concern with the first machine is that all but about 80 of
those 14336 processes are nagios and nsca, with the majority being nsca.
That number continues to increase until I have to reboot the machine.
Using strace and lsof, it appears that the nsca processes are all
waiting to write to nagios.cmd but the multiple nagios processes also
appear to be waiting on or writing to a pipe as well. I'm trying to
understand to overall process... I understand that nsca must wait until
the 4k pipe is cleared before writing more data to it, but I don't
understand the 100 or so nagios processes that also appear to be spawned
by the master process. Do each of those also read commands from
nagios.cmd and then attempt to insert those into the DB?
 
Here are the optimizations I have done on the central server:
 
Nagios - 
command_check_interval=-1
max_concurrent_checks=150   ; does this even apply?
service_reaper_frequency=2
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
execute_service_checks=0
accept_passive_service_checks=1
process_performance_data=0
aggregate_status_updates=1
status_update_interval=15
 
Postgres -
tcpip_socket = true
max_connections = 100
port = 5432 
hostname_lookup = false
show_source_port = false
shared_buffers = 200        # 2*max_connections, min 16
max_fsm_relations = 100    # min 10, fsm is free space map
max_fsm_pages = 10000      # min 1000, fsm is free space map
max_locks_per_transaction = 64 # min 10
wal_buffers = 24            # min 4
sort_mem = 49152             # min 32
vacuum_mem = 49152          # min 1024
effective_cache_size = 30000  # default in 8k pages
 
 
Linux - 
kernel.shmmax = 536870912
kernel.shmall = 536870912
ulimit -u 15000
ulimit -n 3000
 
 
Does anyone have any suggestions? I would appreciate any help I can get.
 
Thanks,
 
Marc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20021126/9a17078e/attachment.html>
Previous message: Check for Citrix TS?
Next message: Still looking for check_ping help
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list