Why does this code not see any significant performance gain when I use multiple threads on a quadcore machine? Why does this code not see any significant performance gain when I use multiple threads on a quadcore machine? multithreading multithreading

Why does this code not see any significant performance gain when I use multiple threads on a quadcore machine?


Busy waiting can be a problem:

while (!executor.isTerminated() ) { } 

You can use awaitTermination() instead:

while (!executor.awaitTermination(1, TimeUnit.SECONDS)) {}


You are using BigInteger. It consumes a lot of register space. What you most likely have on the compiler level is register spilling that makes your process memory-bound.

Also note that when you are timing your results you are not taking into account extra time taken by the JVM to allocate threads and work with the thread pool.

You could also have memory conflicts when you are using constant Strings. All strings are stored in a shared string pool and so it may become a bottleneck, unless java is really clever about it.

Overall, I wouldn't advise using Java for this kind of stuff. Using pthreads would be a better way to go for you.


As @axtavt answered, busy waiting can be a problem. You should fix that first, as it is part of the answer, but not all of it. It won't appear to help in your case (on Q6600), because it seems to be bottlenecked at 2 cores for some reason, so another is available for the busy loop and so there is no apparent slowdown, but on my Core i5 it speeds up the 4-thread version noticeably.

I suspect that in the case of the Q6600 your particular app is limited by the amount of shared cache available or something else specific to the architecture of that CPU. The Q6600 has two 4MB L2 caches, which means CPUs are sharing them, and no L3 cache. On my core i5, each CPU has a dedicated L2 cache (256K, then there is a larger 8MB shared L3 cache. 256K more per-CPU cache might make a difference... otherwise something else architecture wise does.

Here is a comparison of a Q6600 running your Collatz.java, and a Core i5 750.

On my work PC, which is also a Q6600 @ 2.4GHz like yours, but with 6GB RAM, Windows 7 64-bit, and JDK 1.6.0_21* (64-bit), here are some basic results:

  • 10000000 500000 1 (avg of three runs): 36982 ms
  • 10000000 500000 4 (avg of three runs): 21252 ms

Faster, certainly - but not completing in quarter of the time like you would expect, or even half... (though it is roughly just a bit more than half, more on that in a moment). Note in my case I halved the size of the work units, and have a default max heap of 1500m.

At home on my Core i5 750 (4 cores no hyperthreading), 4GB RAM, Windows 7 64-bit, jdk 1.6.0_22 (64-bit):

  • 10000000 500000 1 (avg of 3 runs) 32677 ms
  • 10000000 500000 4 (avg of 3 runs) 8825 ms
  • 10000000 500000 4 (avg of 3 runs) 11475 ms (without the busy wait fix, for reference)

the 4 threads version takes 27% of the time the 1 thread version takes when the busy-wait loop is removed. Much better. Clearly the code can make efficient use of 4 cores...

  • NOTE: Java 1.6.0_18 and later have modified default heap settings - so my default heap size is almost 1500m on my work PC, and around 1000m on my home PC.

You may want to increase your default heap, just in case garbage collection is happening and slowing your 4 threaded version down a bit. It might help, it might not.

At least in your example, there's a chance your larger work unit size is skewing your results slightly...halving it may help you get closer to at least 2x the speed since 4 threads will be kept busy for a longer portion of the time. I don't think the Q6600 will do much better at this particular task...whether it is cache or some other inherent architecture thing.

In all cases, I am simply running "java Collatz 10000000 500000 X", where x = # of threads indicated.

The only changes I made to your java file were to make one of the println's into a print, so there were less linebreaks for my runs with 500000 per work unit so I could see more results in my console at once, and I ditched the busy wait loop, which matters on the i5 750 but didn't make a difference on the Q6600.