new plugin interface for Nagios

Deomid Ryabkov rojer at rbc.ru
Fri May 7 15:33:25 CEST 2004


AE> Deomid Ryabkov wrote:
>> greetings, fellow Nagios users.
>> 
>> well, basically I think it's just about time to add a new plugin interaction interface to Nagios.
>> pretty bold, ha? ;)
>> now let me explain. it has been almost a year since we turned to Nagios for our monitoring needs
>> (we were previosly using BigBrother and oh my dear, was it awful! ;))
>> so we are being almost happy now. however, as configuration continues to grow, the response time
>> of the whole monitoring system increases.
>>
>> currently we have 248 hosts monitored with 755 active checks at a 60 seconds interval.
>> (interval_length=10, normal_check_interval 6)
>> 
AE> I think this is where some of your problems start. Running ALL checks 
AE> with a 60 second interval is hardly useful. You should look into 
AE> implementing different templates for them (we have critical-service (1 
AE> minute interval), default-service (5 minute interval), 
AE> noncritical-service (30 minutes interval)). This allows for excellent 
AE> scalability.
well, this is not the point of discussion. you can always lower check interval
or, just as you say, adjust check intervals by criticalness of related services.
all this i clearly understand, but... you know, it's always nice to be up-to-date on this.
i mean the health of the network. and of course when my LA will be somewhere near 10,
i guess i'll have to lower the check interval or whetever, BUT (and that's the point
I'm trying to make): there's a room for improvement.

>> being in charge of the monitoring, by now i have done all i could to optimize plugins,
>> and in fact this has helped a lot to keep the system running at a decent pace.
>> (for example, i have integrated disk checks into one plugin that uses shared snmplib
>> instead of calling snmpget, effectively elimitaing another fork)

AE> snmpget only loads heavily if it needs to parse the mibs. Use '-m: ' to 
AE> load NO mibs with snmpget. This will make it a whole lot faster.

that is not the problem anymore: only a few of my check use it.

>> so the biggest problem at this time seems to be Nagios's need to launch a process for every check.
>> 
AE> That problem will still exist, unless you mean to make the code 
AE> thread-safe, which would make nagios a memory-hog on large systems (a 
AE> lot more hash buckets would be required for this to work). Besides, on 
AE> linux-systems, fork() uses copy-on-write, so only the PTE needs be created.

well, now it takes fork() + exec() to complete a check. and my aim is that latter exec().
that doesn't make nagios threaded.

>> so now i'm thinking of adding some kind of plugin invocation mechanism into Nagios
>> that wouldn't require starting up another program.
>> and what i am thinking of as my options are:
>> 
>> 1) shared library mechanism, like Apache modules. should be the fastest of all, but has its shortcomings.
>> not very flexible.

AE> Not a bad idea, but nagios would still have to fork() or 
AE> pthread_create() to actually RUN the different checks (unless you want 
AE> it to serialize checks, which is just plain dumb).

basically, i don't mind nagios to fork (yet), but instead of running an external plugin it should...
well, that is to be decided ;)

>> 2) some kind of IPC. this would involve, i think, some check daemon process that'd start with nagios
>> and respond to check requests from it. a pipe or message queue could be used for communication.

AE> Now we're talking. See comments below.

>> 3) just forget about it.
>> 
AE> Not necessarily a bad thing.
of course.

>> i think i'll do that one way or another. but i want to make it The Right Way (r) and this is
>> where i turn to you and ask if you have any ideas/opinions/suggestions and in general, if it's worth
>> implementing at all...
>> 
AE> I'd say that nagios should be split in two.
...
[snip]

okay, that idea sounds a bit too radical for me. what i suggested is in fact implies less of a change,
but it seems to me that it'd be a good optimization.

let me tell you a bit more about what led me to this idea of changing plugin interface.
a real-world example, our current configuration.
for almost every our host we run check_disk_snmp - this is a plugin that gets a summary of
current disk usage via snmp and, given threshold values, returns the appropriate exit code.
as of now, for every check a separate process is launched. arguments are parsed, snmp session
is created and initialized, host's filesystems are enumerated, their current state is recorded,
warning threshold value is obtained (for unix hosts).
then a match of fs data against thresholds is done with most severe condition becoming exitcode.
summary is printed and there we go, check done.
and we do this for more that 200 hosts, every minute (we are leaving the check interval out of our discussion for now).
for me, it seems obvious that this could be optimized. only if we hadn't to start all over every time.
most of the data is the same all the time, so why not to just cache it?
i could write a check_disk_snmpd, that'd create and initialize an snmp session, cache filesystem data
and thresholds and only do a couple of get()'s upon a request arriving from nagios to freshen the data.
seems pretty obvious for me indeed.

so, what is to be done?
basically, we have to teach nagios to open a socket (or sould it be other IPC mechanism? may be a message queue? I'm still unsure)
send it a request packet and settle down waiting for a reply.
the daemon on the other side could be threaded (i think i'd write mine this way), but it doesn't in fact matter.
with socket we could even go as far as running this daemon on remote machine,
but the benefit of this is unclear to me.

that is it. what do you think?

--
 Best regards,
Deomid Ryabkov
UNIX Systems Administrator
RosBusinessConsulting | http://www.rbc.ru/
E-mail: rojer at rbc.ru  | ICQ: 8025844
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2130 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20040507/bf249d41/attachment.bin>


More information about the Developers mailing list