REMIX - Online Detection and Repair of Cache Contention for the JVM

Accepted PLDI 2016

Introduction

Abstract

As ever more computation shifts onto multicore architectures, it is increasingly critical to find effective ways of dealing with multithreaded performance bugs like true and false sharing. Previous approaches to fixing false sharing in unmanaged languages have had to resort to highly-invasive runtime program modification. We observe that managed language runtimes, with garbage collection and JIT code compilation, present unique opportunities to repair such bugs directly, mirroring the techniques used in manual repairs.

We present REMIX, a modified version of the Oracle HotSpot8 JVM which can detect cache contention bugs and repair false sharing at runtime. REMIX's detection mechanism leverages recent performance counter improvements on Intel(R) platforms (PEBS), which allow for precise, unobtrusive monitoring of cache contention at the hardware level. REMIX can detect and repair known false sharing issues in the LMAX Disruptor high-performance inter-thread messaging library and the Spring Reactor event-processing framework, automatically providing 1.5-2x speedups over unoptimized code and matching the performance of hand-optimization. REMIX also finds a new false sharing bug in SPECjvm2008, and uncovers a true sharing bug in the HotSpot JVM that, when fixed, improves the performance of three NAS Parallel Benchmarks by 7-25x. REMIX incurs no statistically-significant performance overhead on other benchmarks that do not exhibit cache contention, making REMIX practical for always-on use.

Requirements

REMIX Requirements:

  • Hardware support for cache coherence PEBS events (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM and MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM). These are available in Intel(R) Haswell, Broadwell and Skylake processors. Unfortunately this means REMIX cannot be tested in virtual machines, which do not have access to the hardware events.
REMIX was tested on single- and dual-socket 8-core Haswell (model 63) processors, but should run on any Intel CPU with model numbers 62, 63, 69, 70,71, 78, 86, 94 (Models of the aforementioned processor families available at the time of writing).
To check the CPU model, run awk '/model\s*:/{print $3}' /proc/cpuinfo
  • Linux kernel version 4.1.0 and higher built for x86_64, with PERF support enabled (default). A precompiled 4.4.1 kernel for Ubuntu 15.04 can be found in the download section.

You can verify your kernel version and architecture with uname -r -m.

Paper

Accepted version of the paper: remix-final.pdf.

Tutorial

Installation

REMIX source code can be found at https://github.com/upenn-acg/REMIX (GPLv2 License).

Precompiled binaries available at remix-0.91-x64-jdk.tar.xz.

tar xvfJ remix-0.91-x64-jdk.tar.xz

REMIX is executed like regular java, with the addition of the -remix command line parameter.

remix/jdk/bin/java -remix ...

or, if added to the PATH:

export PATH=~/path/to/remix/jdk/bin:$PATH
java -remix ...

To compile from sources, download the source version, and then:

git clone https://github.com/upenn-acg/REMIX.git
cd REMIX/jvm-remix/
mkdir build
cd build
../openjdk/configure
make

More information on compiling JDK8 can be found at http://hg.openjdk.java.net/build-infra/jdk8/raw-file/tip/README-builds.html.

Simple Example

Consider the following java file, BaseTest.java:

public final class BaseTest implements Runnable {
    public final static class VolatileLong {
        public volatile long value = 0L;
    }

    public final static int NUM_THREADS = 8; 
    public final static long ITERATIONS = 2000L * 1000L * 1000L;
    private final int arrayIndex;

    private static VolatileLong[] longs = new VolatileLong[NUM_THREADS];
    static {
        for (int i = 0; i < longs.length; i++) {
            longs[i] = new VolatileLong();
        }
    }

    public BaseTest(final int arrayIndex) {
        this.arrayIndex = arrayIndex;
    }

    public static void main(final String[] args) throws Exception {
        final long start = System.nanoTime();
        runTest();
        System.out.println("BaseTest Time " + (System.nanoTime() - start) + "ns");
    }

    private static void runTest() throws InterruptedException
    {
        Thread[] threads = new Thread[NUM_THREADS];
        for (int i = 0; i < threads.length; i++) 
            threads[i] = new Thread(new BaseTest(i));

        for (Thread t : threads)
            t.start();

        for (Thread t : threads)
            t.join();
    }

    public void run() {
        long i = ITERATIONS + 1;
        VolatileLong vl = longs[arrayIndex];
        long sum = 0;
        while (0 != --i) {
            vl.value = i;
            sum += vl.value;
        }
    }

}

First we run without REMIX:

>> ~/remix/jdk/bin/java BaseTest
BaseTest Time 366717292370ns

The test took 366.7 seconds.

Now let's compare With REMIX:

>> ~/remix/jdk/bin/java -remix -XX:+REMIXVerbose BaseTest
REMIX: SamplingRatio is 1:10000.
REMIX: System tests passed.
REMIX: relayout phase starting.
REMIX: Considering padding for this=0x7f82d4000e40 klass=LBaseTest$VolatileLong; (direct=1, old=24 0x800060230, backup=0x800060600, parent=(nil))
REMIX: relayout phase done.
BaseTest Time 17621340141ns
REMIX: Transformed 1 klasses, moved 8 instances.

We specified -remix to enable REMIX, and -XX:+REMIXVerbose to get some status updates. As can be seen, running time is now just 17.6 seconds, more than a 20x improvement (for this admittedly artificial case). REMIX also supports a detect-only mode, invoked with -remix:detect. To ensure that the extremely high volume of events for this particular benchmark does not slow execution considerably, we sample just 1 out of 1,000,000 events (default is 1:10000):

>> ~/remix/jdk/bin/java -remix:detect -XX:REMIXSamplingRatio=10000000 BaseTest
BaseTest Time 297897616371ns
REMIX: Processed 1 classes:
REMIX: Hits for class 'BaseTest$VolatileLong':
REMIX: - Hit BaseTest$VolatileLong::value[16]: 1008:0
REMIX: Transformed 0 klasses, moved 0 instances.

We can see that the field VolatileLong.value, at offset 16, was the source of 1008 HITM events. Multiplying this latter value by the sampling rate, we see that the hardware generated approximately a billion events during the program's execution.

We can compare REMIX's performance to that of manual padding, by padding the VolatileLong class with 7 more longs:

public final static class VolatileLong {
        public volatile long p0, p1, p2, p3, p4, p5, p6;
        public volatile long value = 0L;
}

Running this version without REMIX results in a total runtime of 14.6 seconds. The three second delay is mostly the delay before REMIX identifies the false sharing.

REMIX Options

REMIX supports all standard Java8 options, but online repair requires the use of a stop-the-world GC such as the ParNewGC.

REMIX java accepts the following additional options:

java -remix[:online|:detect] - Enable REMIX support for the JVM, with online repair or detection only. Omitting the coloin defaults to online repair
     -XX:+REMIXVerbose - Enable verbose reporting.
     -XX:+REMIXHitVerbose - Enable verbose HITM event reporting.
     -XX:REMIXSamplingRatio=<ratio> - Control REMIX sampling ratio. Default value is 10000, e.g. only one event out of 10,000 is sampled.
     -XX:+REMIXDebug - Enable debug logging.

REMIX changes the default GC to the ParNewGC. REMIX can also be used in conjunction with the serial GC:

 ~/remix/jdk/bin/java -XX:REMIXLevel=3 -XX:+UseSerialGC ...

instead of specifying -remix directly.

Compiling a custom kernel with PERF support

REMIX requires the following settings to be enabled in the kernel .config file (kernel >= 4.1):

CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_CGROUP_PERF=y
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_EVENTS=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_PERF_REGS=y

You can check if your current kernel supports these options by looking for them in /boot/config-`uname -r`.

We supply precompiled Linux 4.4.1 binaries for Ubuntu 15.04 : linux-image-4.4.1_4.4.1-10.00.Custom_amd64.deb, /linux-headers-4.4.1_4.4.1-10.00.Custom_amd64.deb.