Message storms, service statuses, intervals and so on

Mohr James james.mohr at elaxy.com
Wed Oct 20 14:33:27 CEST 2004


Hi All!

I was testing Nagios's behavior when encountering a message storm. Using
send_nsca, I set a service to OK and then sent 10K critical messages.
The script that calls send_nsca passes the current loop count as part of
the message text, so I can see in nagios.log which message is currently
being processed. Before each loop I reset the service to OK. I did this
a couple of times and on our machine, nagios seems to be able to process
about 40 messages a second.

The service is defined like this:

define service (
       use                             generic-service         ; Name of
service template to use

        host_name                       sol-sys-02
        service_description             http
        is_volatile                     0
        check_period                    24x7
        active_checks_enabled           0
        passive_checks_enabled          1
        max_check_attempts              1
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  vpo
        notification_interval           120
        notification_period             24x7
        notification_options            w,u,c,r
        check_command                   check_http
}

We have a notification set up to create a trouble ticket in our help
desk tool. That mechanism works fine. 

There are a couple of things that have made me curious that I cannot
explain. To begin with, it is not the first event that creates the
trouble ticket. Once it a was at count 150 another time was count 351.
So why isn't the first event triggering the trouble ticket? If I
understand it correctly,  "max_check_attempts 1" says Nagios should
react the very first time. How come it then waits for the 351st event
before reacting? I could understand it if the notifcation is triggered
and then notification program reads the current state (including the
current event with the current count). By the time the notification gets
around to ready the state info the count has increased. 

The next is the "Status Information" field in the web browser. The
content is obviously changing as I can see that the count value
increases almost up to the 10000. The status does not change, but the
system is always getting the "current" message. My problem is I am not
sure as to what mechanism is used to determine how long the system would
wait before updating. 

The section in the doc "Service Check Scheduling" talks about
rescheduling when a service is down, but I have not found anything (yet)
that says how often the "Status Information" is updated. It is not the
normal_check_interval because that is too long. However, the
retry_check_interval *could* be it. Still, the retry_check_interval
seems to imply that this is the time Nagios waits to retry to check the
service and not necessarily the time between updating the "Status
Information". Note that this is not an issue of updating the browser, as
I am continually pressing the refresh button. Instead, I see this as an
issue with Nagios.

Any help is greatly appreaciated.

Regards,

Jim Mohr


-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list