Separate mail server problems cause Nagios to plotz (or vice versa?)

Allan Clark allanc at chickenandporn.com
Fri Jun 24 21:10:55 CEST 2011


On Fri, Jun 24, 2011 at 11:53,  <up at 3.am> wrote:
>> Quoting up at 3.am:
>>
>>> We have Nagios monitoring a variety of services on roughly 50
>>> separate servers.  Several of them
>>> are mail servers, but only the "main" (that contains most of the
>>> Nagios notification recipients)
>>> one has this problem.
>>>
>>> The mail server will start to become unresponsive so just about any
>>
>>> input (but pings fine).
>>
>> This is a mail server issue. You would need to determine exactly what
>> process(es) have become unresponsive and why.
>
> We're still trying to figure that out...but the question for this list
> is why Nagios would go nuts.

Do you have any staleness stuff on the tests that go nuts?

Is it possible to place many of the sendmail tests (ie if you're
checking mqueue) as dependencies of another test (such as "is it
responding to port 25?") so that when the sendmail gets strange, at
least many of the tests are then skipped?


>>> Simultaneously, Nagios, which is on a separate server, will send
>> out
>>> notifications that every
>>> service on every server is down because Nagios cannot reach them. 
>>
>>
>> Why can't it reach them? Is your mail server also your router?
>
> Good Gosh, no!  That's why this is so puzzling.

re: staleness above: can you watch your Nagios log, perhaps filtering
it through awk to add a timestamp to each entry, just spool that on a
terminal, and when things get strange and Nagios goes nuts, is Nagios
at least running the tests and getting responses?

You mention LDAP; is your sendmail server also your LDAP server, and
is the Nagios host also using LDAP to resolve basic OS features like
UID?

Allan
-- 
allanc at chickenandporn.com  "金鱼" http://linkedin.com/in/goldfish

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a 
definitive record of customers, application performance, security 
threats, fraudulent activity and more. Splunk takes this data and makes 
sense of it. Business sense. IT sense. Common sense.. 
http://p.sf.net/sfu/splunk-d2d-c1
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list