JVM issue: concurrency is affected by changing the date of the system! [part 3]

I have been asked further information about the matter and for that reason I am pushing a bit of more code here. It’s C++, so be aware of it! For the records, we are looking at the sources of the hotspot JVM, you can find the source here:

http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/aed585cafc0d/src/os/linux/vm/os_linux.cpp

Let’s have a look at the park() function of PlatformEvent, which is used within all synchronization primitives of the JVM:

int os::PlatformEvent::park(jlong millis) {
   [...]
 
   struct timespec abst;
   compute_abstime(&abst, millis);
   [...]

   while (_Event < 0) {
     status = os::Linux::safe_cond_timedwait(_cond, _mutex, &abst);
     if (status != 0 && WorkAroundNPTLTimedWaitHang) {
       pthread_cond_destroy (_cond);
       pthread_cond_init (_cond, NULL) ;
     }
     assert_status(status == 0 || status == EINTR ||
                   status == ETIME || status == ETIMEDOUT,
                   status, "cond_timedwait");
     if (!FilterSpuriousWakeups) break ;                 // previous semantics
     if (status == ETIME || status == ETIMEDOUT) break ;
     // We consume and ignore EINTR and spurious wakeups.
   }

Please look at the line in bold, where the end time to wait is computed: if you open that function (line 5480) you will notice that it’s calculating an absolute time. based on the wall clock

   [...]
   static struct timespec* compute_abstime(timespec* abstime, jlong millis) {

      if (millis < 0)  millis = 0;

      struct timeval now;
      int status = gettimeofday(&now, NULL);
      [...]

So what will happen is that the park function will be waiting on an absolute time based on a wall clock, hence will fail miserably if the wall clock is changed.

The simplest fix, without changing too much code, would be to use the CLOCK_MONOTONIC (or CLOCK_MONOTONIC_RAW, even better) to compute the absolute time ( clock_gettime(CLOCK_MONOTONIC, &ts) ) and also to check it the same way in the main loop (you can associate any available clock with a pthread_cond_timewait)

Then, if we really want to stay on the safe side, we should avoid using absolute delays and use relative delays, as POSIX specs explicitly guarantees that threads waiting on a relative time are not affected to changes to the underling clock, while when using absolute delays the situation is historically “fuzzy”.

Is that complex? I does not look so, at least looking at the code (I will try to patch it myself for sure) but I surely do not grasp the complexity of the whole hotspot, so I may fail miserably. It also have to be noted that my C++ skills are kind of dated :)

See also
The full saga, all the articles I published on the matter:

JVM issue: concurrency is affected by changing the date of the system! [part 2]

Based on a lot of questions I received in various mailing lists related to the previous post and in order to make the issue simpler and clearer I decided to go back to a binary deliverable (code) that shows the problem, hope this helps!

This is my PreciousPool class, that handles Precious resources:

import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.Condition;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;

public class PreciousPool {

    public static class Precious {
        private final int id;

        private Precious() {
            this.id = 100+(int)(Math.random()*900.0);
        }

        public String toString() {
            return "Precious n."+id;
        }
    }

    private final Lock lock;
    private final Condition ready;
    private final long timeoutInMillis;

    private final List preciousLended;
    private final List preciousAvailable;

    public PreciousPool(int size, long timeoutInSeconds) {
        this.lock = new ReentrantLock();
        this.ready = lock.newCondition();

        this.timeoutInMillis = 1000L*timeoutInSeconds;
        this.preciousLended =  new ArrayList();
        this.preciousAvailable = new ArrayList();

        for (int i = 0; i < size; i++) {
            preciousAvailable.add(new Precious());
        }
    }

    public Precious obtain()  {
        lock.lock();
        try {
            // if no precious are available we wait for the specified timeout (releasing the lock so that others can try)
            if (preciousAvailable.size() == 0) {
                try {
                    ready.await(timeoutInMillis, TimeUnit.MILLISECONDS);
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                    throw new RuntimeException("Somebody interrupted me!", e);
                }
            }

            // if a precious is available we unload it and return to the caller, otherwise null
            if (preciousAvailable.size() > 0) {
                Precious value = preciousAvailable.remove(0);
                preciousLended.add(value);
                return value;
            } else {
                return null;
            }
        } finally {
            lock.unlock();
        }
    }

    public void release(Precious value) {
        lock.lock();
        try {
            if (!preciousLended.remove(value))
                throw new RuntimeException("Element "+value+" was not lended!");

            // if a precious is returned we put it back and signal to anybody waiting
            preciousAvailable.add(value);
            ready.signalAll();
        } finally {
            lock.unlock();
        }
    }

    public static void main(String args[]) {
        final int size = 3;
        final PreciousPool pool = new PreciousPool(size, 5);

        // let's exhaust the pool
        for (int i=0; i<size; i++)
            dump(pool.obtain());

        // and as we are stubborn we continuosly ask for a new one
        while(true) {
            dump(pool.obtain());
        }
    }

    private static void dump(Precious precious) {
        if (precious == null)
            log("I did not get my precious :(");
        else
            log("I did get my precious! "+precious);
    }

    private static void log(String message) {
        final String now = new SimpleDateFormat("HH:mm:ss:SSSS ").format(new Date());
        System.out.println(now + message);
    }
}

So, the main is a single thread (no need for multithreading here, let’s keep it simple), that first exhaust the whole pool and then keep asking, without success, for a resource. Stubborn guy, I say, but it happens. If you run this program everything works as expected: you are greeted by a three successful Precious and then an endless list of failures, that it continuously grow. All good :)

02:34:40:0061 I did get my precious! Precious n.156
02:34:40:0062 I did get my precious! Precious n.991
02:34:40:0062 I did get my precious! Precious n.953
02:34:45:0064 I did not get my precious!
02:34:50:0065 I did not get my precious!
02:34:55:0066 I did not get my precious!
02:35:00:0067 I did not get my precious!
02:35:05:0068 I did not get my precious!
[...]

But guess what happens when, while the program is running, I change the date of my system back of one hour? Everything stops, it’s simple as that. No prints, nothing, zero, nada. Now, If it wasn’t so late, I would probably wait one hour in order to have my program restored to his normal process, but as a customer I won’t be terribly happy :)

See also
The full saga, all the articles I published on the matter:

JVM issue: concurrency is affected by changing the date of the system!

Executive summary
The implementation of the concurrency primitive LockSupport.parkNanos(), the function that controls *every* concurrency primitive on the JVM, is flawed, and any NTP sync, or system time change backwards, can potentially break it with unexpected results across the board when running a 64bit JVM on Linux 64bit

What we need to do?
This is an old issue, and the bug was declared private. I somehow managed to have the bug reopened to the public, but it’s still a P4, that means that probably won’t be fixed. I think we need to push for a resolution ASAP, be sure that’s in for JDK9, make all the possible effort to make this fix for JDK8 or, at least, to include it in a later patch release. In an ideal world it would be nice to have a patch for JDK7

Why all this urgency?
If a system time change happens then all the threads parked will hang, with unpredictable/corrupted/useless results to the end user. Same applies to Future, Queue, Executor, and any other construct that it’s somehow related to concurrency. This is a big issue for us and for any near time application: please think about trading and betting, where the JVM is largely used. And please do not restrain yourself to the Java language: add Scala and any other JVM-based language to the picture.

All the details (spoiler: tech stuff!)
To be more clear about the issue, the extent of it and the concurrency library, let me introduce this very simple program:

import java.util.concurrent.locks.LockSupport;

public class Main {
  public static void main(String[] args) {
    for (int i=100; i>0; i--) {
      System.out.println(i);
      LockSupport.parkNanos(1000L*1000L*1000L);
    }
    System.out.println("Done!");
  }
}

Run it with a 64bit 1.6+ JVM on 64bit Linux, turn the clock down one hour and wait until the counter stops… magic! I tested this on JDK6, JDK7 and latest JDK8 beta running on various Ubuntu distros. It’s not just a matter of (old?) sleep() and wait() primitives, this issue it affects the whole concurrency library.

To prove that this is fixable, I reimplemented the program above above substituting LockSupport.parkNanos() with a JNI call toclock_nanosleep(CLOCK_MONOTONIC…): works like a charm :(
This is due to the fact that the CPP code is calling the pthread_cond_timedwait() using its default clock (CLOCK_REALTIME) which, unfortunately is affected by settime()/settimeofday() calls (on Linux): for that reason it cannot be used to measure nanoseconds delays, which is what the specification requires. CLOCK_REALTIME is not guaranteed to monotonically count as this is the actual “system time”: each time my system syncs time using a NTP server on the net, the time might jump forward or backward. The correct call (again on Linux) would require to use CLOCK_MONOTONIC as clock id, which are defined by POSIX specs since 2002. (or better CLOCK_MONOTONIC_RAW)

The POSIX spec is infact clear, as it states “…setting the value of the CLOCK_REALTIME clock via clock_settime() shall have no effect on threads that are blocked waiting for a relative time service based upon this clock…”: it definitely states “relative”. Having a look at the hotspot code, it appears that the park() is using compute_abstime() (which uses timeofday) and then waits on an absolute period: for that reason it’s influenced by the system clock change. Very wrong.

Next steps?
I am trying to raise the awareness of this issue, basically involving as much people as I can. I will continue to do that and escalate as soon as I get some solid response from Oracle. I am also working on a patch myself.

See also
The full saga, all the articles I published on the matter:

Gitinabox, get your private git!

The long wait is over! GerritForge LLP today announced the availability of the first private beta of Git-in-a-box, download it today!

gitinaboxGit-in-a-box is the revolutionary HTML5 User-Experience to get started using Git on your premises with a “plug & play” Server installation. Take a tour of the features available and check out the howto section. No install, no hassle: just double click on the java executable jar and you will have a fully featured Git backend, fully managed trough a nice web interface, while enjoying http and ssh access to your repositories from your network!

What you can do with it?

  • create and manage your repositories with simple clicks
  • create and manage users and groups with no dependencies
  • control security and access to your code
  • access statistics and browsing
  • enjoy a pleasant user interface on your phone, tablet or computer

Git-in-a-box is currently in private beta till end of March 2013. With all the feedback received and the stabilisation fixed, the final release of Git-in-a-box will be officially available for Download from April 2013.

Check it out at gitinabox.com!

Flash Plugin for Ubuntu 64

Long time since last post, but at least a useful bit of information. On Ubuntu 11.04 64bit if you install the flashplugin-installer, either alone or as part of the metapackage ubuntu-restricted-extras you will get…. surprise, a 32bit flaash plugin! Very annoying, also because it does not work very well.

Just install adobe-flushplugin instead: it will uninstall the old version and you will get a proper 64bit version :)

Git Enterprise is now in private beta!

Git enterprise web site

Git Enterprise private beta is now open! You can request a beta key sending an email at beta2@gitenterprise.com

Git is a fast distributed version control system that can efficently handle small and large projects. It simplifies distributed development and protect investments with its application compatibility features (SVN, CVS). Git Enterprise leverages on Git technical strengths adding advanced enterprise management features. Click here for a full overview of Git Enterprise features!

We are all very excited about this… we can’t wait to get some feedback from you guys, hope we’ll see you soon!