What mechanisms other than mutexs or garbage collection can slow my multi-threaded java program? What mechanisms other than mutexs or garbage collection can slow my multi-threaded java program? multithreading multithreading

What mechanisms other than mutexs or garbage collection can slow my multi-threaded java program?


I have a piece of java code (JDK 1.6.0._22 if relevant)

There have been quite allot of performance improvements since then. I would try Java 6 update 37 or Java 7 update 10.

It does however use a lot of memory

This can mean the way you access your data can be important. Accessing data in main memory can be 20+x slower than in your primary cache. This means you have to access data conservatively and make the most of each piece of new data you access.

After 3 threads I can put as many threads as I want to the task, and the performance gets no better Instead I found all the threads running: just very slowly.

This suggest you are using a resource to it's maximum. The most likely resource to be maxed out given the amount of memory you are using is the cpu to main memory bridge. I suspect you have one bridge for 64 threads! This means you should consider ways which might use more cpu but improve the way you access memory (less randomly and more sequentially) and reduce the volumes when you do (use more compact types where possible). e.g. I have "short with two decimal places" type instead of float which can half the memory used.

As you observed, when each thread is updating it's own private AtomicLong you got linear scalability. This won't use the cpu to main memory bridge at all.


From @Marko

Peter, do you have an idea how these multicore architectures work with memory, anyway?

Not as much as I would like as this issue is not visible to Java.

Does each core have an independent channel?

Each core has an independent channel to the primary cache. For the outer cache there might be a channel for each or 2-6 cache areas but you will a high number of collisions under heavy load.

For the bridge to the main memory, There is one very wide channel. This favours long sequential accesses but is very poor for random accesses. A single thread can max this out with random reads (random enough they don't fit in the outer cache)

Or at least independent as long at there are no collisions?

Once you exhaust the primary cache (L1 typically 32 KB) it's collisions all the way.

Because otherwise scaling is a great issue.

As the OP demonstrates. Most applications either a) spend a significant portion of time waiting for IO b) does allot of computation on small batches of data. Doing allot of computation across large amounts of data is the worst cases senario.

The way I deal with this is to arrange my data structures in memory for sequential access. I use off heap memory which is a pain, but gives you full control of the lay out. (My source data is memory mapped for persistence) I stream the data in with sequential accesses and try to make the most of that data (i.e. I minimise repeated accesses) Even then with 16 cores it is hard to assume all of them will be used efficiently as I have 40 GB of source data I am working on at any one time and about 80 GB of derived data.

Note: High end GPUs address this problem by having incredibly high memory bandwidth. The top end processor can get 250 GB/second whereas a typical CPU is about 4-6 GB/second. Even so they are better suited to vectorised processing and their quoted peak performance is likely to have little memory access e.g. mandelbrot sets.

http://www.nvidia.com/object/tesla-servers.html


Well many experiments later I discover that the JVM makes no difference, but I also discovered the power of JDump. 50 of the 64 threads were at the following line.

java.lang.Thread.State: RUNNABLE    at java.util.Random.next(Random.java:189)    at java.util.Random.nextInt(Random.java:239)    at sun.misc.Hashing.randomHashSeed(Hashing.java:254)    at java.util.HashMap.<init>(HashMap.java:255)    at java.util.HashMap.<init>(HashMap.java:297)

Random.next looks like this

 protected int next(int bits) {    long oldseed, nextseed;    AtomicLong seed = this.seed;    do {        oldseed = seed.get();        nextseed = (oldseed * multiplier + addend) & mask;    } while (!seed.compareAndSet(oldseed, nextseed));    return (int)(nextseed >>> (48 - bits)); }

Most interestingly is that this isn't an obvious lock, so the tools I using to spot mutexes weren't working.

So it looks as though any creation of java hashmaps causes applications to stop being scalable (I exaggerate but not much). My application does make heavy use of hashmaps, so I guess I either rewrite hashmap or rewrite the application.

I'm raising a separate question to see how to deal with this.

Thanks for all the help


You might be running into the allocation wall, that is: your program can run no faster than object allocation, which is limited by memory bandwidth.