JVM issue: concurrency is affected by changing the date of the system!

Executive summary
The implementation of the concurrency primitive LockSupport.parkNanos(), the function that controls *every* concurrency primitive on the JVM, is flawed, and any NTP sync, or system time change backwards, can potentially break it with unexpected results across the board when running a 64bit JVM on Linux 64bit

What we need to do?
This is an old issue, and the bug was declared private. I somehow managed to have the bug reopened to the public, but it’s still a P4, that means that probably won’t be fixed. I think we need to push for a resolution ASAP, be sure that’s in for JDK9, make all the possible effort to make this fix for JDK8 or, at least, to include it in a later patch release. In an ideal world it would be nice to have a patch for JDK7

Why all this urgency?
If a system time change happens then all the threads parked will hang, with unpredictable/corrupted/useless results to the end user. Same applies to Future, Queue, Executor, and any other construct that it’s somehow related to concurrency. This is a big issue for us and for any near time application: please think about trading and betting, where the JVM is largely used. And please do not restrain yourself to the Java language: add Scala and any other JVM-based language to the picture.

All the details (spoiler: tech stuff!)
To be more clear about the issue, the extent of it and the concurrency library, let me introduce this very simple program:

import java.util.concurrent.locks.LockSupport;

public class Main {
  public static void main(String[] args) {
    for (int i=100; i>0; i--) {

Run it with a 64bit 1.6+ JVM on 64bit Linux, turn the clock down one hour and wait until the counter stops… magic! I tested this on JDK6, JDK7 and latest JDK8 beta running on various Ubuntu distros. It’s not just a matter of (old?) sleep() and wait() primitives, this issue it affects the whole concurrency library.

To prove that this is fixable, I reimplemented the program above above substituting LockSupport.parkNanos() with a JNI call toclock_nanosleep(CLOCK_MONOTONIC…): works like a charm :(
This is due to the fact that the CPP code is calling the pthread_cond_timedwait() using its default clock (CLOCK_REALTIME) which, unfortunately is affected by settime()/settimeofday() calls (on Linux): for that reason it cannot be used to measure nanoseconds delays, which is what the specification requires. CLOCK_REALTIME is not guaranteed to monotonically count as this is the actual “system time”: each time my system syncs time using a NTP server on the net, the time might jump forward or backward. The correct call (again on Linux) would require to use CLOCK_MONOTONIC as clock id, which are defined by POSIX specs since 2002. (or better CLOCK_MONOTONIC_RAW)

The POSIX spec is infact clear, as it states “…setting the value of the CLOCK_REALTIME clock via clock_settime() shall have no effect on threads that are blocked waiting for a relative time service based upon this clock…”: it definitely states “relative”. Having a look at the hotspot code, it appears that the park() is using compute_abstime() (which uses timeofday) and then waits on an absolute period: for that reason it’s influenced by the system clock change. Very wrong.

Next steps?
I am trying to raise the awareness of this issue, basically involving as much people as I can. I will continue to do that and escalate as soon as I get some solid response from Oracle. I am also working on a patch myself.

See also
The full saga, all the articles I published on the matter:

About these ads

16 thoughts on “JVM issue: concurrency is affected by changing the date of the system!

  1. Pingback: JVM issue: concurrency is affected by changing the date of the system! [part 2] | It can't rain forever...
  2. Pingback: Ziemliches fetter Java-Problem…. - Java Blog | Javainsel-Blog
  3. Pingback: JVM issue: concurrency is affected by changing the date of the system! [part 3] | It can't rain forever...
  4. “If a system time change happens then all the threads parked will hang”

    This is simply untrue. The impact of this bug is limited to specific linux-64 bit kernels; where the system clock is moved backwards by a significant amount; and where the timed-wait primitive (Thread.sleep, Object.wait(millis), LockSupport.parkNanos) is used for directly scheduling an event. Most uses of timed-waits are defensive, for error recovery and the timeout rarely comes into play because the expected notifications occur.
    If you are impacted by this then the impact is severe. But most code is not impacted; and most systems don’t introduce large backward time jumps (which as has been noted elsewhere will potentially break a lot of things beyond the JVM).

    • Hello David, good to see you here. The fact that the bug affects only 64bit is clearly stated, so thanks. that’s covered :). Regarding the fact that the hanging happens only when the time move backwards, you are correct, so for the sake of precision I changed the text accordingly. Regarding the “limited impact” I beg to differ to your analysis, as noted elsewhere, but I will make sure to create another blog post to make this clear. Finally, the fact that other applications can be impacted by time changes does not make this bug less serious.

      Thanks for you contribution!

  5. Is there any workaround available on mentioned systems?

    Or do we need to implement JNI native equivalents to circuvent this bug.

    Thanks for any help!

    • The workaround is to not let large backward jumps in system time occur on your systems. (Note DST changes do not cause jumps in the system time.)
      Also this only affects specific 64-bit Linux distributions, I believe those with glibc 12 and later.

      • Thanks David! Unfortunately the majority of us who use JVM in trading systems should be aware of these bug and how to avoid it at any cost!

      • The workaround of “X is causing problem Y, who can I workaround it?” is “Just not make X happen!”: I am really not sure this is a proper workaround.

        As already stated elsewhere, this bug is affecting for sure Ubuntu 10, Ubuntu 11, Ubuntu 12, and Ubuntu 13, so its impact is not very limited, and these days we do not have a lot of 32bit system in the wild.

      • Given you don’t _need_ to continually change the system clock then not changing it is certainly a potential workaround. It may not be a universal workaround but if you can control it then don’t do it.

        Can you confirm which version of glibc is used by Ubuntu 10?

    • Unfortunately it’s not possible to re-implement JNI native equivalents, unless you want to rewrite the whole concurrency system or the impact for you is very limited. What you can do is:
      - create a watchdog process (in Python or any unaffected language) that restarts your JVMs
      - create a watchdog thread in Java that uses JNI calls to detect a change of the clock and, when this happens, change accordingly the system time

      Any suggestion is welcome tough!

    • Yes, that’s exactly what’s happening on our client environment. Any backward jump will affect the JVM, for the amount of time of the jump itself: so, for example, if the timer jumps back of 10 seconds, then for 10 seconds all the thread waiting on a lock will be waiting for at least an extra 10 seconds.

      • Correction: All threads doing a *timed* wait for a lock, where the lock is not made available within the 10 seconds, would wait the extra 10 seconds.

        Can anyone point to a technical reference explaining why a virtualized system would need to change the time-of-day clock in the guest to deal with timeslicing?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s