Message storms, service statuses, intervals and so on

Andreas Ericsson ae at op5.se
Wed Oct 20 21:59:55 CEST 2004


Mohr James wrote:
> Hi All!
> 
> I was testing Nagios's behavior when encountering a message storm. Using
> send_nsca, I set a service to OK and then sent 10K critical messages.
> The script that calls send_nsca passes the current loop count as part of
> the message text, so I can see in nagios.log which message is currently
> being processed. Before each loop I reset the service to OK. I did this
> a couple of times and on our machine, nagios seems to be able to process
> about 40 messages a second.
> 

That seems pretty consistent with some experiments I've done with named 
pipes. If you empty it as fast as you can, you can squeeze through about 
200kB/second (PIII 833), but with the risk of dataloss. CPU speed 
doesn't seem to matter all that much. It drops to 180k on a PII 300, but 
jumps only to 210k for a P4 3.0 with HT enabled.

Nagios has other things on its mind, so it doesn't empty the pipe as 
often as a completely mindless message-eater can.

This can be increased rather simply though, by hacking the kernel and 
increasing the buffer for named pipes. It's hard to predict how that 
will affect other programs though, so it's probably best not to.

> The service is defined like this:
> 
> define service (
>        use                             generic-service         ; Name of
> service template to use
> 
>         host_name                       sol-sys-02
>         service_description             http
>         is_volatile                     0
>         check_period                    24x7
>         active_checks_enabled           0
>         passive_checks_enabled          1
>         max_check_attempts              1
>         normal_check_interval           5
>         retry_check_interval            1
>         contact_groups                  vpo
>         notification_interval           120
>         notification_period             24x7
>         notification_options            w,u,c,r
>         check_command                   check_http
> }
> 
> We have a notification set up to create a trouble ticket in our help
> desk tool. That mechanism works fine. 
> 
> There are a couple of things that have made me curious that I cannot
> explain. To begin with, it is not the first event that creates the
> trouble ticket. Once it a was at count 150 another time was count 351.
> So why isn't the first event triggering the trouble ticket? If I
> understand it correctly,  "max_check_attempts 1" says Nagios should
> react the very first time. How come it then waits for the 351st event
> before reacting? I could understand it if the notifcation is triggered
> and then notification program reads the current state (including the
> current event with the current count). By the time the notification gets
> around to ready the state info the count has increased. 
> 
> The next is the "Status Information" field in the web browser. The
> content is obviously changing as I can see that the count value
> increases almost up to the 10000. The status does not change, but the
> system is always getting the "current" message. My problem is I am not
> sure as to what mechanism is used to determine how long the system would
> wait before updating. 
> 
> The section in the doc "Service Check Scheduling" talks about
> rescheduling when a service is down, but I have not found anything (yet)
> that says how often the "Status Information" is updated. It is not the
> normal_check_interval because that is too long. However, the
> retry_check_interval *could* be it. Still, the retry_check_interval
> seems to imply that this is the time Nagios waits to retry to check the
> service and not necessarily the time between updating the "Status
> Information". Note that this is not an issue of updating the browser, as
> I am continually pressing the refresh button. Instead, I see this as an
> issue with Nagios.
> 
> Any help is greatly appreaciated.
> 
> Regards,
> 
> Jim Mohr
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
> Use IT products in your business? Tell us what you think of them. Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
> 

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Lead Developer


-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list