Bug report: downtimes beyond 2038 cause event queue errors

Ton Voon ton.voon at opsview.com
Mon Apr 8 14:12:20 CEST 2013


On 4 Apr 2013, at 22:55, Andreas Ericsson wrote:
>> This fails on CentOS 5 64bit, though appears to work on Debian Squeeze 32bit, so it maybe a 64 bit only issue.
>> 
>> We think this is an issue when the event is scheduled via squeue_add(). We've managed to get the test-squeue to fail by changing the time value to be greater than 2038 with the following:
>> 
>> Index: test-squeue.c
>> ===================================================================
>> --- test-squeue.c	(revision 2716)
>> +++ test-squeue.c	(working copy)
>> @@ -116,7 +116,7 @@
>>  	sq_test_random(sq);
>>  	t(squeue_size(sq) == 0, "Size should be 0 after first sq_test_random");
>> 
>> -	t((a.evt = squeue_add(sq, time(NULL) + 9, &a)) != NULL);
>> +	t((a.evt = squeue_add(sq, time(NULL)*2, &a)) != NULL);
>>  	t(squeue_size(sq) == 1);
>>  	t((b.evt = squeue_add(sq, time(NULL) + 3, &b)) != NULL);
>>  	t(squeue_size(sq) == 2);
>> 
>> This gives the test result of:
>> 
>> ### squeue tests
>>   FAIL max <= *d @test-squeue.c:86
>>   FAIL x == &b @test-squeue.c:133
>>   FAIL x->id == b.id @test-squeue.c:134
>>   FAIL x == &c @test-squeue.c:141
>> about to fail pretty fucking hard...
>> ea: 0xbfe065e0; &b: 0xbfe065d8; &c: 0xbfe065d0; ed: 0xbfe065c8; x: 0xbfde9b80
>>   FAIL x == &b @test-squeue.c:152
>>   FAIL x->id == b.id @test-squeue.c:153
>>   FAIL x == &b @test-squeue.c:160
>>   FAIL x->id == b.id @test-squeue.c:161
>>   FAIL x == &c @test-squeue.c:166
>>   FAIL x->id == c.id @test-squeue.c:167
>> Test results: 390637 passed, 10 failed
>> 
>> Changing to a factor of 1.1 instead of 2 passes:
>> 
> 
> I'm not surprised. 1.1 would mean it's still within the unix timeframe.
> 
> What's the size of time_t, long and struct timeval on systems where it 
> fails?
> What's the sizes on systems where it succeeds?

With the recreation steps, Nagios 4 works fine on rhel5 32bit, but fails on rhel5 64bit.

sizes.c:

#include <string.h>
#include <stdio.h>
#include <assert.h>
#include <sys/types.h>
#include <signal.h>
#include <unistd.h>
#include <sys/time.h>
#include "pqueue.h"

int main(int argc, char **argv)
{
    struct timeval tv;

    printf("long = %d\n", sizeof(long));
    printf("time_t = %d\n", sizeof(time_t));
    printf("tv = %d\n", sizeof(tv));
    printf("pqueue_pri_t = %d\n", sizeof(pqueue_pri_t));
    return 0;

}

RHEL5 32 bit:
long = 4
time_t = 4
tv = 8
pqueue_pri_t = 8


RHEL5 64 bit:
long = 8
time_t = 8
tv = 16
pqueue_pri_t = 8

> Does time_t differ in signedness on them?

Not sure how to check this.

> I think a runtime check based on those sizes should work just fine, and
> also be optimized away so we don't actually have to pay for it, but I'm
> curious to see where it actually goes wrong. If it's before we get to
> see the number in squeue.c we're pretty much fscked, as the only option
> then is a macro which does voodoo-casting so the squeue api sees the
> right number.
> 
>> ### squeue tests
>> Test results: 390647 passed, 0 failed
>> 
>> This worked in Nagios 3, so we're guessing that the change to use the squeue library for events is probably where this limitation has come in.
>> 
>> Any thoughts?
>> 
> 
> Well, modifying the evt_compute_pri() algorithm to discard
> everything but the 21 least significant bits of the tv->tv_usec
> would allow us to use 43 bits for the seconds value. That would
> land us somewhere in the year 141234 before we run out of seconds.
> It's not a real fix though, since we could live with discarding
> events that are patently absurd, but blocking the entire scheduler
> because we get a bogus date is just plain wrong.

I've changed the code so it now looks like this:

static pqueue_pri_t evt_compute_pri(struct timeval *tv)
{
        pqueue_pri_t ret;

        /* keep weird compilers on 32-bit systems from doing wrong */
        if(sizeof(pqueue_pri_t) < 8) {
                ret = tv->tv_sec;
                ret += !!tv->tv_usec;
        } else {
                ret = (pqueue_pri_t) tv->tv_sec;
                ret <<= 43;
                ret |= (tv->tv_usec & 0x1FFFFF);
        }

        return ret;
}

For the same recreation steps, the event queue is now working properly. 

The changes I made to test-squeue.c to change the multiplication factor now works up to a factor of 1,000,000 on a 64 bit system. These tests fail on 32 bit, but that's to be expected since the time_t part is 32 bit.

So 43 bits for seconds + 21 bits for usec seem fine.

> Besides, with 43 bits for the seconds we could still get too
> large a number for us to handle and we'd still be back at square 1.

I notice that in pqueue.h that pqueue_pri_t is changed from a double to unsigned long long:

/*
 * Altered for Nagios by Andreas Ericsson <ae at op5.se> with the excplicit
 * consent of Volkan Yazici <volkan.yazici at gmail.com>. Many thanks.
 * Changed as follows:
 * 
 * - pqueue_pri_t is an unsigned long long instead of a double
 *   ull comparisons are 107 times faster than double comparisons
 *   on my 64-bit laptop
 */

Would it be better to leave it as a double, so that all values will work properly, and take the performance hit?

Ton


------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire 
the most talented Cisco Certified professionals. Visit the 
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html




More information about the Developers mailing list