negative check latency with Nagios as VM?

Steve Shipway s.shipway at auckland.ac.nz
Mon Aug 20 07:23:50 CEST 2007


We run a lot of VMWare here, although we're running our Nagios on a
physical box for performance reasons.  I've spent a lot of time
researching how to monitor virtual hosts and the potential pitfalls...

> We're testing our Nagios 2.9 implementation on a VMWare server.  This
> box does have the VMWare tools installed and is running NTP to sync
> time.

Linux under VMWare seems to work best if you let VMWare Tools synch the
time to the ESX server (which uses NTP to synch its own time).  If you
run NTP on a virtual host, it can sometimes get confused as vmware-tools
will also adjust the time.  Similarly, a Windows guest should try to
rely on vmware-tools for the clock synch not anything else.

> The performance on this box seems a bit worse, but roughly comparable
to
> our physical box.  (Oddly enough, Nagios restart almost
instantaneously
> on the VM where it takes around 20 seconds to respond to the web
> interface on the physical box...)

If your old box was Nagios 1.x then that's the reason.  Nagios 2 is
much, much faster in the web interface because it preparses and caches
the configuration. Another possibility could be that your virtual disk
is held partly in memory cache on the ESX server, speeding up initial
access.

> at one point I saw the minimum check time at -2.00 seconds.  This
means
> this VM is so fast that it's running checks before they're even
> scheduled!  Wow!

This is because your clock is getting skewed.  VMWare is not good for
anything which is sensitive at any resolution smaller than 1min, because
the clock hops about a bit due to the virtualisation.  Particularly when
you're running ntp *and* vmware-tools it can cause weird behaviour as
they fight over who is authoritative.

> In any case, I was concerned about this.  My biggest worry with a VM
is
> that it doesn't track the time well enough.  

This is very much the case, a guest OS under VMWare will experience
weird clock behaviour.  This is why plugins like check_net, check_cpu,
and anything rate-based are pointless and actually misleading if run via
NRPE in a VM.  A plugin which queries SNMP to get a counter and then
calculates its own rate on a different (physical) server is fine, as
long as the rate calculation is not run in a VM.

> Or perhaps I'm just associating this with a VM and it's just Nagios
> itself.  Has anyone seen this before?

I've see it before in checks run under VMWare.  If you want to check CPU
usage under VMWare, I'm working with some people at Bright House
Networks on the new version of check_esx to support ESX3.  The old
version works with ESX2.

In brief -
* Don't run NTPD and vmware-tools together
* Don't run check_cpu, check_net or check_memory for a guest
* Don't run any rate-based checks on a virtual machine
* Don't run Nagios under VMWare if you can avoid it

Hope this helps,

Steve

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list