<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=us-ascii"> <meta name="Generator" content="Microsoft Exchange Server">  <style></style> </head> <body> <font face="Calibri, sans-serif" size="2"> <div>A couple of days ago, I ran into a problem I’ve never seen before. We run a single large instance with mostly very heterogeneous checks and host types. One particular group of Windows hosts, however, are all quite similar and they, like most of our other checks rely on the use of templates. I needed to add 10 more hosts of this particular type and typically all I have to do is just define the hosts and the service checks happen automatically as the host templates include them in a group that includes all the relevant checks.</div> <div> </div> <div>I added maybe 5 of these new hosts, ran the pre-flight check and restarted. After the restart I started noticing that our failing service checks (for all services) went from around 260 to over 4K. All of those new failing checks were only on hosts of this same type (that particular application on Windows servers I mentioned above which is also what these new hosts were part of) and they were all reporting the same failure condition:</div> <div> </div> <div style="padding-left: 36pt; ">(Return code of 127 is out of bounds - plugin may be missing)</div> <div style="padding-left: 36pt; "> </div> <div>Now ordinarily this would indicate a client-side issue, but there isn’t one. I can validate that by running check_nrpe manually against any of these hosts. I could imagine a typo that would cause this, particular against other existing hosts that had not been touched, but I double-checked and did not find one (I was just adding host definitions to this group – nothing else).</div> <div> </div> <div>I cloned this environment and went to play with it in a non-production instance that was identical to the production Nagios instance except for a slight newer version of Merlin in the backend (1.1.14 for the non-prod instance, 1.1.13 something for the production one), but both used the same Nagios 3.3.1 + downtime locking patches. I was able to reproduce the situation and after a couple of days of trial and error I’ve still not been able to completely isolate the issue, but I’ve determined that</div> <div> </div> <ul style="margin-top: 0pt; margin-bottom: 0pt; margin-left: 36pt; "> <li>it’s not got anything to do with the mk-livestatus module (turned it off, turned it back on), but it’s been very helpful in figuring out which of the 13K+ services and 1200+ hosts are impacted</li><li>it doesn’t seem to be about adding random hosts and services. I can add others and this doesn’t happen</li><li>the host definition uses a template that puts the host in a hostgroup. Those hostgroups are then used to in service definitions (12-15 services, depending on which group). I had thought that perhaps if the hostgroup_name line of the service definition expanded to too many hosts that could be the problem. I broke the service definitions down into 2 definitions, one for each production hostgroup rather than combining them and that didn’t matter.</li><li>the service templates that the service definitions use for these hosts all add them to a common servicegroup. My current line of thinking leads me to believe it’s got something to do with this. With a particular test scenario I created where I create a new host, but exclude it from the hostgroup definitions and instead manually create service definitions for this host (I know this “one more host” is right on the cusp of this problem), I find that when I add it so the 4,331<font size="1"><sup>st</sup></font> service gets added to the servicegroup, the problem starts. If I remove that from that host’s service definition all the other hosts’ services recover. However, based on this thinking, if I just comment out the servicegroup add from the service template these hosts use, the problem should stop – it doesn’t.</li><li>the only affect services are on all of the hostgroup I’m changing. Other unrelated hosts and services are unaffected. There are 3 hostgroups: Production Appname Hosts 1, Production Appname Hosts 2, and All Appname Hosts which is obviously a combination of the two. All Appname Hosts is around 324 hosts.</li></ul> <div> </div> <div>I’m not really sure what to try at this point. It does seem like I’ve hit some kind of internal limitation with Nagios, but I don’t know how to determine anything else about it beyond this. If I were able to completely isolate this to say, not adding anything to a single servicegroup, I could avoid that and continue adding hosts as we need it, but I have so far not been able to find such a workaround. If there is a limitation like this, it would of course, be nice for the pre-flight check to tell me that I can’t have more than X members of a servicegroup or something.</div> <div> </div> <div>Other info:</div> <div> </div> <div style="padding-left: 36pt; ">Nagios version: Nagios 3.3.1 with locking patches</div> <div style="padding-left: 36pt; ">Merlin backend: 1.1.13+ (production), 1.1.14 (test)</div> <div style="padding-left: 36pt; ">MK-Livestatus module 1.1.12p6 installed (uninstalled doesn’t impact)</div> <div style="padding-left: 36pt; ">OS: SLES 11.1 Linux, 64-bit</div> <div style="padding-left: 36pt; ">Memory: 12GB</div> <div style="padding-left: 36pt; ">CPU: 2x 2.4Ghz quad-core Xeon</div> <div> </div> <div>What can I do?</div> <div> </div> <div>Thanks</div> <div> </div> <div>Mark</div> <div> </div> </font> </body> </html>