Reports only show data from a specific tim e period?

Paul L. Allen pla at softflare.com
Fri Feb 6 20:52:14 CET 2004
Previous message: Reports only show data from a specific tim e period?
Next message: check_http on strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
BOLLENGIER Eric writes: 

> Wouldn't it be better to have it set to "UNDETERMINATE" outside of the
> check_period ?

I assumed trends and availability would do something sensible outside
the check period (I've never had to use them against something that
wasn't 24x7).  Looking closely at the FAQ and docs, the state is
undetermined outside the check period *but* the CGIs use the last known
state outside of check periods, so it's working as documented (but not as
expected). There's no explicit statement on the CGIs page about this,
so you have to do some digging to find out that's how it works. 

After some thought, I concluded that this is not ideal for host/service
checks, but better than showing undetermined.  If you use the CGIs at
all, you probably want to know that there was a failure at the end of the
check period which may persist until the host/service is next needed.
If they show undetermined you would not see from the CGI that there had
been a fault which may still be present and will need to be dealt with.
If they show undetermined then the first you'll know, if all you look
at is the CGI, is the next morning when the host/service is needed (and
you may get an angry phone call first). 

So there's possibly an argument for having an orange "Outside of check
period but last known status was critical" state indicating that the
host/service is not needed at this time but may have a fault.  But most
people will get mail and/or pager alerts anyway and I'd guess most
monitoring is 24x7, so it's hard to justify the effort involved in adding
an extra state.  Yeah, it would mean my boss, who keeps an eye on the
status displays, would see an orange instead of a red and know it wasn't
something he had to panic about, but I can live with him having occasional
panics as long as he checks whether the machine is inside or outside the
check period before demanding it be fixed immediately or demanding to know
why it hasn't been fixed yet.  And my guess is it wouldn't be simple to
implement, either, so I doubt the developers would be eager to add it. 

The case for availability and trends is not as simple.  If you
have an outage 5 minutes before the check period ends but the host or
service returns before the next check period starts then availability is
going to show a 16-hour, 5-minute outage in a 24-hour period whereas
what you would like is for it to show a 5 minute outage in an 8-hour
monitored period and to ignore the unmonitored period compleely. 

I don't think showing 16 hours of undetermined (as you ask for) is the
best solution because there can be other causes for the state to be
undetermined, so availability needs to distinguish between undertermined
(Nagios tried to check but couldn't get an answer) and unmonitored
(Nagios wasn't trying to check at that time).  Or it could simply ignore
the  unmonitored period completely.  But I'd want either of those to
be display options if implemented at all, not the only.way of displaying
it.  I want to be able to see a pessemistic display (as at present) not an
optimistic display.  I want to assume the worst and be proved wrong rather
than assume the best and get angry phone calls.  Seeing a large chunk of
critical rather than a brief period of critical followed by a chunk of
undetermined or unmonitored makes it clear to me that the problem could
well have persisted all night rather than leading me to assume it was just
a brief problem. 

The current behaviour of the CGIs suits the majority of our requirements
because we have almost no monitored items that are turned off overnight,
and the few that are belong to clients who aren't interested in the
availability data because they want us to take care of it all for them
without them having to do anything themselves.  That may not be true of
your situation, but I suspect most checking done with Nagios is 24x7 even
if alerts are restricted by the check period.  So although the
availability and trends are not ideal for all usage, they probably suit
most people.  Unless the changes required are trivial, I think it's
probably going to be hard to justify the effort involved to the
developers given that there  are more important things that could be (and
are being) improved. 

I would guess the easiest solution for those that need it is to
take the raw data and process it with something else, or to have a script
which modifies the data by inserting "undetermined" at the end of each
check period on things which aren't monitored 24x7 (you'd need to
be careful about file locking, or copy the files elsewhere, modify the
copies, and have another copy of the CGIs that works on the files in the
new location).  Or, rather than having a cron job restart nagios every
hour, just schedule a restart at the end of the check period (unless you
have a lot of different check periods involved).  I don't think those are
ideal, because I'd prefer to see a distinction between undetermined and
unmonitored, but it's not me that wants those features. 

Or, if you really want those features without bodging, code them up and
submit a patch.  I can see that Softflare might have need of something
like that in the future, so it was worth taking the time (which became
free because I had to wait for one of our client's offices to empty so I
could do some potentially disrupive work outside their working hours) to
think about options and make suggestions in case we ever do need them.  But
until we do actually need them, I'm not going to do any playing with it
because maybe, by then, somebody else will have added something like that. 

Without looking at the code, my first thoughts about how you'd build
that sort of feature into Nagios itself make it appear to be hard.  One
way would be to rewrite the CGIs to be capable of getting the check periods
from the configuration files and making appropriate adjustments to
what they display.  Another way would be to modify the Nagios scheduler so
that when inside the check period it performs a check every X minutes and
when outside the check period it instead writes "unmonitored" to the status
file every X minutes.  Either way looks like it could be difficult to
implement, so if I needed this sort of thing right now I'd go for one of
the bodges and live with the fact that the results are not my preferred
solution. 



-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: Reports only show data from a specific tim e period?
Next message: check_http on strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list