Bug report: downtimes beyond 2038 cause event queue errors

Andreas Ericsson ae at op5.se
Tue Apr 9 20:35:04 CEST 2013


On 04/08/2013 02:12 PM, Ton Voon wrote:
>
> On 4 Apr 2013, at 22:55, Andreas Ericsson wrote:
>>> This fails on CentOS 5 64bit, though appears to work on Debian Squeeze 32bit, so it maybe a 64 bit only issue.
>>>
>>> We think this is an issue when the event is scheduled via squeue_add(). We've managed to get the test-squeue to fail by changing the time value to be greater than 2038 with the following:
>>>
>>> Index: test-squeue.c
>>> ===================================================================
>>> --- test-squeue.c	(revision 2716)
>>> +++ test-squeue.c	(working copy)
>>> @@ -116,7 +116,7 @@
>>>   	sq_test_random(sq);
>>>   	t(squeue_size(sq) == 0, "Size should be 0 after first sq_test_random");
>>>
>>> -	t((a.evt = squeue_add(sq, time(NULL) + 9, &a)) != NULL);
>>> +	t((a.evt = squeue_add(sq, time(NULL)*2, &a)) != NULL);
>>>   	t(squeue_size(sq) == 1);
>>>   	t((b.evt = squeue_add(sq, time(NULL) + 3, &b)) != NULL);
>>>   	t(squeue_size(sq) == 2);
>>>
>>> This gives the test result of:
>>>
>>> ### squeue tests
>>>    FAIL max <= *d @test-squeue.c:86
>>>    FAIL x == &b @test-squeue.c:133
>>>    FAIL x->id == b.id @test-squeue.c:134
>>>    FAIL x == &c @test-squeue.c:141
>>> about to fail pretty fucking hard...
>>> ea: 0xbfe065e0; &b: 0xbfe065d8; &c: 0xbfe065d0; ed: 0xbfe065c8; x: 0xbfde9b80
>>>    FAIL x == &b @test-squeue.c:152
>>>    FAIL x->id == b.id @test-squeue.c:153
>>>    FAIL x == &b @test-squeue.c:160
>>>    FAIL x->id == b.id @test-squeue.c:161
>>>    FAIL x == &c @test-squeue.c:166
>>>    FAIL x->id == c.id @test-squeue.c:167
>>> Test results: 390637 passed, 10 failed
>>>
>>> Changing to a factor of 1.1 instead of 2 passes:
>>>
>>
>> I'm not surprised. 1.1 would mean it's still within the unix timeframe.
>>
>> What's the size of time_t, long and struct timeval on systems where it
>> fails?
>> What's the sizes on systems where it succeeds?
>
> With the recreation steps, Nagios 4 works fine on rhel5 32bit, but fails on rhel5 64bit.
>
> sizes.c:
>
> #include <string.h>
> #include <stdio.h>
> #include <assert.h>
> #include <sys/types.h>
> #include <signal.h>
> #include <unistd.h>
> #include <sys/time.h>
> #include "pqueue.h"
>
> int main(int argc, char **argv)
> {
>      struct timeval tv;
>
>      printf("long = %d\n", sizeof(long));
>      printf("time_t = %d\n", sizeof(time_t));
>      printf("tv = %d\n", sizeof(tv));
>      printf("pqueue_pri_t = %d\n", sizeof(pqueue_pri_t));
>      return 0;
>
> }
>
> RHEL5 32 bit:
> long = 4
> time_t = 4
> tv = 8
> pqueue_pri_t = 8
>
>
> RHEL5 64 bit:
> long = 8
> time_t = 8
> tv = 16
> pqueue_pri_t = 8
>

Ah. Obviously. We're truncating 16 bytes down to 8 for priority,
checking, but retain the 16 bytes to check if it's time to actually
run the event yet, so we hang in the event execution loop when
we come across such events.

I've committed a fix which will allow events to be scheduled
until year 141166 for 64-bit systems (43 bits). It's not
possible to extend the time for 32-bit systems, so if you think
you managed to schedule things beyond year 2038, you were wrong.

That last point is sort of important though. It's still possible
to schedule things that appear to be far into the future that
actually run straight away. We can't do anything about that, because
in lib/squeue we can't differ between "far in the past" or "far into
the future", since the value is already wrapped when we get it there.

That means it's considered buggy user-behaviour to try to schedule
things beyond year 2038 on 32-bit systems. We won't support it, ever,
because we can't. It's also considered a buggy user-behaviour to try
to schedule things beyond year 141166 on 64-bit systems. Both actions
will result in "undefined behaviour", which means we'll run the event
whenever the truncated value happens to suggest it should be run (so
it's actually pretty well-defined behaviour but will be hard for users
to verify).

We will not, however, stop running events anymore.

>>
>> Well, modifying the evt_compute_pri() algorithm to discard
>> everything but the 21 least significant bits of the tv->tv_usec
>> would allow us to use 43 bits for the seconds value. That would
>> land us somewhere in the year 141234 before we run out of seconds.
>> It's not a real fix though, since we could live with discarding
>> events that are patently absurd, but blocking the entire scheduler
>> because we get a bogus date is just plain wrong.
>
> I've changed the code so it now looks like this:
>
> static pqueue_pri_t evt_compute_pri(struct timeval *tv)
> {
>          pqueue_pri_t ret;
>
>          /* keep weird compilers on 32-bit systems from doing wrong */
>          if(sizeof(pqueue_pri_t) < 8) {
>                  ret = tv->tv_sec;
>                  ret += !!tv->tv_usec;
>          } else {
>                  ret = (pqueue_pri_t) tv->tv_sec;
>                  ret <<= 43;
>                  ret |= (tv->tv_usec & 0x1FFFFF);
>          }
>
>          return ret;
> }
>
> For the same recreation steps, the event queue is now working properly.
>

No it's not. you just removed the 11 most significant bits of the
priority, so what you think is scheduled 30 years into the future is
in reality scheduled to run sometime within the next 24 days. The fix
I proposed was to change it to

   ret <<= 21;
   ret |= tv->tv_usec & ((1 << 21) - 1);

Although the second line is an unnecessary change, since that's
what it effectually means already, it's easier to read that way
and will prevent others from making the same mistake you just
did.

> The changes I made to test-squeue.c to change the multiplication factor now works up to a factor of 1,000,000 on a 64 bit system. These tests fail on 32 bit, but that's to be expected since the time_t part is 32 bit.
>
> So 43 bits for seconds + 21 bits for usec seem fine.

Except you botched the implementation, so you turned that entire
statement around. It's 21 bits for seconds with your patch. Just
enough for 24 days.

>
>> Besides, with 43 bits for the seconds we could still get too
>> large a number for us to handle and we'd still be back at square 1.
>
> I notice that in pqueue.h that pqueue_pri_t is changed from a double to unsigned long long:
>
> /*
>   * Altered for Nagios by Andreas Ericsson <ae at op5.se> with the excplicit
>   * consent of Volkan Yazici <volkan.yazici at gmail.com>. Many thanks.
>   * Changed as follows:
>   *
>   * - pqueue_pri_t is an unsigned long long instead of a double
>   *   ull comparisons are 107 times faster than double comparisons
>   *   on my 64-bit laptop
>   */
>
> Would it be better to leave it as a double, so that all values will work properly, and take the performance hit?
>

Definitely not. The event queue is the heart of Nagios. Making it
107 times slower so we can support scheduling checks to see exactly
when the sun explodes is just plain stupid.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter




More information about the Developers mailing list