Why Is Java Not Utilising All My CPU Cores Effectively [duplicate] Why Is Java Not Utilising All My CPU Cores Effectively [duplicate] multithreading multithreading

Why Is Java Not Utilising All My CPU Cores Effectively [duplicate]


Using multiple CPUs helps up to the point you saturate some underlying resource.

In your case, the underlying resource is not the number of CPUs but the number of L1 caches you have. In your case it appears you have two cores, with an L1 data cache each and since you are hitting it with a volatile write, it is the L1 caches which are your limiting factor here.

Try accessing the L1 cache less with

public class Example implements Runnable {    // using this so the compiler does not optimise the computation away    volatile int temp;    void delay(int arg) {        for (int i = 0; i < arg; i++) {            int temp = 0;            for (int j = 0; j < 1000000; j++) {                temp += i + j;            }            this.temp += temp;        }    }    int arg;    int result;    Example(int arg) {        this.arg = arg;    }    public void run() {        delay(arg);        result = 42;    }    public static void main(String... ignored) {        int MAX_THREADS = Integer.getInteger("max.threads", 8);        long[] times = new long[MAX_THREADS + 1];        for (int numThreads = MAX_THREADS; numThreads >= 1; numThreads--) {            long start = System.nanoTime();            // Start up the threads            Thread[] threadList = new Thread[numThreads];            Example[] exampleList = new Example[numThreads];            for (int i = 0; i < numThreads; i++) {                exampleList[i] = new Example(1000);                threadList[i] = new Thread(exampleList[i]);                threadList[i].start();            }            // wait for the threads to finish            for (int i = 0; i < numThreads; i++) {                try {                    threadList[i].join();                    System.out.println("Joined with thread, ret=" + exampleList[i].result);                } catch (InterruptedException ie) {                    System.out.println("Caught " + ie);                }            }            long time = System.nanoTime() - start;            times[numThreads] = time;            System.out.printf("%d: %.1f ms%n", numThreads, time / 1e6);        }        for (int i = 2; i <= MAX_THREADS; i++)            System.out.printf("%d: %.3f time %n", i, (double) times[i] / times[1]);    }}

On my dual core, hyperthreaded laptop it produces in the form threads: factor

2: 1.093 time 3: 1.180 time 4: 1.244 time 5: 1.759 time 6: 1.915 time 7: 2.154 time 8: 2.412 time 

compared with the original test of

2: 1.092 time 3: 2.198 time 4: 3.349 time 5: 3.079 time 6: 3.556 time 7: 4.183 time 8: 4.902 time 

A common resource to over utilise is the L3 cache. This is shared across CPUs and while it allows a degree of concurrency, it doesn't scale well above to CPUs. I suggest you check what your Example code is doing and make sure they can run independently and not use any shared resources. e.g. Most chips have a limited number of FPUs.


The Core i5 in a Lenovo X1 Carbon is not a quad core processor. It's a two core processor with hyperthreading. When you're performing only trivial operations that do not result in frequent, long pipeline stalls, then the hyperthreading scheduler won't have much opportunity to weave other operations into the stalled pipeline and you won't see performance equivalent to four actual cores.


There are several things that can limit how effectively you can multi-thread an application.

  1. Saturation of a resource such as memory/bus/etc bandwidth.

  2. Locking/contention issues (for example if threads are constantly having to wait for each other to finish).

  3. Other processes running on the system.

In your case you are using a volatile integer being accessed by all of the threads, that means that the threads are constantly having to send the new value of that integer between themselves. This will cause some level of contention and memory/bandwidth usage.

Try switching each thread to be working on its own chunk of data with no volatile variable. That should reduce all forms of contention.