negative check latency with Nagios as VM?

Frost, Mark {PBG} mark.frost1 at pepsi.com
Mon Aug 20 17:39:16 CEST 2007


Steve,

Thanks for your comments.

We're running the VM as a copy of our current Nagios instance which is
also 2.9.
So most likely, it's that the disk is being cached as you indicated.

At this point, because it sounds like we have a choice, I think we may
push
to keep Nagios on a physical box.

We're going to break our current architecture out and finally start
using
a distributed model (possibly with some failover as well).  I was
considering having some of the distributed servers be virtual.  If the
"master" (i.e. the server that the distributed boxes reported to) was
physical
but one or more of the distributed servers were VM's would that pose a
serious
problem?  We were also talking about making the failover server for the
master
a VM.  In the event of a disaster that would require a failover, I
suppose we
could deal with that server's idiosyncracies for the duration of that
outage.

Thanks

Mark

-----Original Message-----
From: Steve Shipway [mailto:s.shipway at auckland.ac.nz] 
Sent: Monday, August 20, 2007 1:24 AM
To: Frost, Mark {PBG}
Cc: nagios-users at lists.sourceforge.net
Subject: RE: [Nagios-users] negative check latency with Nagios as VM?

We run a lot of VMWare here, although we're running our Nagios on a
physical box for performance reasons.  I've spent a lot of time
researching how to monitor virtual hosts and the potential pitfalls...

> We're testing our Nagios 2.9 implementation on a VMWare server.  This
> box does have the VMWare tools installed and is running NTP to sync
> time.

Linux under VMWare seems to work best if you let VMWare Tools synch the
time to the ESX server (which uses NTP to synch its own time).  If you
run NTP on a virtual host, it can sometimes get confused as vmware-tools
will also adjust the time.  Similarly, a Windows guest should try to
rely on vmware-tools for the clock synch not anything else.

> The performance on this box seems a bit worse, but roughly comparable
to
> our physical box.  (Oddly enough, Nagios restart almost
instantaneously
> on the VM where it takes around 20 seconds to respond to the web
> interface on the physical box...)

If your old box was Nagios 1.x then that's the reason.  Nagios 2 is
much, much faster in the web interface because it preparses and caches
the configuration. Another possibility could be that your virtual disk
is held partly in memory cache on the ESX server, speeding up initial
access.

> at one point I saw the minimum check time at -2.00 seconds.  This
means
> this VM is so fast that it's running checks before they're even
> scheduled!  Wow!

This is because your clock is getting skewed.  VMWare is not good for
anything which is sensitive at any resolution smaller than 1min, because
the clock hops about a bit due to the virtualisation.  Particularly when
you're running ntp *and* vmware-tools it can cause weird behaviour as
they fight over who is authoritative.

> In any case, I was concerned about this.  My biggest worry with a VM
is
> that it doesn't track the time well enough.  

This is very much the case, a guest OS under VMWare will experience
weird clock behaviour.  This is why plugins like check_net, check_cpu,
and anything rate-based are pointless and actually misleading if run via
NRPE in a VM.  A plugin which queries SNMP to get a counter and then
calculates its own rate on a different (physical) server is fine, as
long as the rate calculation is not run in a VM.

> Or perhaps I'm just associating this with a VM and it's just Nagios
> itself.  Has anyone seen this before?

I've see it before in checks run under VMWare.  If you want to check CPU
usage under VMWare, I'm working with some people at Bright House
Networks on the new version of check_esx to support ESX3.  The old
version works with ESX2.

In brief -
* Don't run NTPD and vmware-tools together
* Don't run check_cpu, check_net or check_memory for a guest
* Don't run any rate-based checks on a virtual machine
* Don't run Nagios under VMWare if you can avoid it

Hope this helps,

Steve

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list