Bug report: downtimes beyond 2038 cause event queue errors

Anton Löfgren alofgren at op5.com
Fri Apr 5 09:41:56 CEST 2013


For the record, the following patch identically triggers the faults on my
x86_64 Arch installation (where time_t normally is a quad word).

Subject: [PATCH] test-squeue: trigger y2038 bug

Signed-off-by: Anton Lofgren <alofgren at op5.com>
---
 lib/test-squeue.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/test-squeue.c b/lib/test-squeue.c
index 556952d..faf0b5d 100644
--- a/lib/test-squeue.c
+++ b/lib/test-squeue.c
@@ -11,6 +11,7 @@
 #include <sys/time.h>
 #include "squeue.c"
 #include "t-utils.h"
+#include <stdint.h>

 static void squeue_foreach(squeue_t *q, int (*walker)(squeue_event *, void
*), void *arg)
 {
@@ -116,7 +117,7 @@ int main(int argc, char **argv)
        sq_test_random(sq);
        t(squeue_size(sq) == 0, "Size should be 0 after first
sq_test_random");

-       t((a.evt = squeue_add(sq, time(NULL) + 9, &a)) != NULL);
+       t((a.evt = squeue_add(sq, (int32_t)(time(NULL) * 2), &a)) != NULL);
        t(squeue_size(sq) == 1);
        t((b.evt = squeue_add(sq, time(NULL) + 3, &b)) != NULL);
        t(squeue_size(sq) == 2);
-- 
1.8.2




test-squeue:
### squeue tests
  FAIL max <= *d @test-squeue.c:87
  FAIL x == &b @test-squeue.c:134
  FAIL x->id == b.id @test-squeue.c:135
  FAIL x == &c @test-squeue.c:142
about to fail pretty fucking hard...
ea: 0x7fffe6dd1e50; &b: 0x7fffe6dd1e60; &c: 0x7fffe6dd1e70; ed:
0x7fffe6dd1e80; x: 0x7fffe6d83de0
  FAIL x == &b @test-squeue.c:153
  FAIL x->id == b.id @test-squeue.c:154
  FAIL x == &b @test-squeue.c:161
  FAIL x->id == b.id @test-squeue.c:162
  FAIL x == &c @test-squeue.c:167
  FAIL x->id == c.id @test-squeue.c:168
Test results: 390637 passed, 10 failed
make[1]: *** [test] Error 1


/Anton

On Thu, Apr 4, 2013 at 11:55 PM, Andreas Ericsson <ae at op5.se> wrote:

> On 04/04/2013 06:32 PM, Ton Voon wrote:
> > Hi!
> >
> > We've come across a problem in an upgrade of Nagios 3 to Nagios 4 which
> we can't work out where the fix is. It occurs when an event is scheduled in
> the future beyond 2038.
> >
>
> Why on earth would you want to schedule something to end beyond 2038?
> It sounds like you're using a patch on a workaround for something that
> was the wrong solution in the first place.
>
> > Recreation steps:
> >    * Set a downtime on a service to end next day
> >    * Stop Nagios
> >    * Edit the retention.dat so that the end_date=4514791088 (some other
> values seem to work)
> >    * Start Nagios
> >
> > When Nagios starts, it will not run any scheduled events in the events
> queue.
> >
>
> Ouch. That's pretty bad.
>
> > This fails on CentOS 5 64bit, though appears to work on Debian Squeeze
> 32bit, so it maybe a 64 bit only issue.
> >
> > We think this is an issue when the event is scheduled via squeue_add().
> We've managed to get the test-squeue to fail by changing the time value to
> be greater than 2038 with the following:
> >
> > Index: test-squeue.c
> > ===================================================================
> > --- test-squeue.c     (revision 2716)
> > +++ test-squeue.c     (working copy)
> > @@ -116,7 +116,7 @@
> >       sq_test_random(sq);
> >       t(squeue_size(sq) == 0, "Size should be 0 after first
> sq_test_random");
> >
> > -     t((a.evt = squeue_add(sq, time(NULL) + 9, &a)) != NULL);
> > +     t((a.evt = squeue_add(sq, time(NULL)*2, &a)) != NULL);
> >       t(squeue_size(sq) == 1);
> >       t((b.evt = squeue_add(sq, time(NULL) + 3, &b)) != NULL);
> >       t(squeue_size(sq) == 2);
> >
> > This gives the test result of:
> >
> > ### squeue tests
> >    FAIL max <= *d @test-squeue.c:86
> >    FAIL x == &b @test-squeue.c:133
> >    FAIL x->id == b.id @test-squeue.c:134
> >    FAIL x == &c @test-squeue.c:141
> > about to fail pretty fucking hard...
> > ea: 0xbfe065e0; &b: 0xbfe065d8; &c: 0xbfe065d0; ed: 0xbfe065c8; x:
> 0xbfde9b80
> >    FAIL x == &b @test-squeue.c:152
> >    FAIL x->id == b.id @test-squeue.c:153
> >    FAIL x == &b @test-squeue.c:160
> >    FAIL x->id == b.id @test-squeue.c:161
> >    FAIL x == &c @test-squeue.c:166
> >    FAIL x->id == c.id @test-squeue.c:167
> > Test results: 390637 passed, 10 failed
> >
> > Changing to a factor of 1.1 instead of 2 passes:
> >
>
> I'm not surprised. 1.1 would mean it's still within the unix timeframe.
>
> What's the size of time_t, long and struct timeval on systems where it
> fails?
> What's the sizes on systems where it succeeds?
> Does time_t differ in signedness on them?
>
> I think a runtime check based on those sizes should work just fine, and
> also be optimized away so we don't actually have to pay for it, but I'm
> curious to see where it actually goes wrong. If it's before we get to
> see the number in squeue.c we're pretty much fscked, as the only option
> then is a macro which does voodoo-casting so the squeue api sees the
> right number.
>
> > ### squeue tests
> > Test results: 390647 passed, 0 failed
> >
> > This worked in Nagios 3, so we're guessing that the change to use the
> squeue library for events is probably where this limitation has come in.
> >
> > Any thoughts?
> >
>
> Well, modifying the evt_compute_pri() algorithm to discard
> everything but the 21 least significant bits of the tv->tv_usec
> would allow us to use 43 bits for the seconds value. That would
> land us somewhere in the year 141234 before we run out of seconds.
> It's not a real fix though, since we could live with discarding
> events that are patently absurd, but blocking the entire scheduler
> because we get a bogus date is just plain wrong.
>
> Besides, with 43 bits for the seconds we could still get too
> large a number for us to handle and we'd still be back at square 1.
>
> --
> Andreas Ericsson                   andreas.ericsson at op5.se
> OP5 AB                             www.op5.se
> Tel: +46 8-230225                  Fax: +46 8-230231
>
> Considering the successes of the wars on alcohol, poverty, drugs and
> terror, I think we should give some serious thought to declaring war
> on peace.
>
>
> ------------------------------------------------------------------------------
> Minimize network downtime and maximize team effectiveness.
> Reduce network management and security costs.Learn how to hire
> the most talented Cisco Certified professionals. Visit the
> Employer Resources Portal
> http://www.cisco.com/web/learning/employer_resources/index.html
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20130405/8c736282/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire 
the most talented Cisco Certified professionals. Visit the 
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list