how can you measure the time spent in a context switch under java platform how can you measure the time spent in a context switch under java platform multithreading multithreading

how can you measure the time spent in a context switch under java platform


You can't easily differentiate the waste due to thread-switching and that due to memory cache contention. You CAN measure the thread contention.. Namely, on linux, you can cat /proc/PID/XXX and get tons of detailed per-thread statistics. HOWEVER, since the pre-emptive scheduler is not going to shoot itself in the foot, you're not going to get more than say 30 ctx switches per second no matter how many threads you use.. And that time is going to be relatively small v.s. the amount of work you're doing.. The real cost of context-switching is the cache pollution. e.g. there is a high probability that you'll have mostly cache misses once you're context-switched back in. Thus OS time and context-switch-counts are of minimal value.

What's REALLY valuable is the ratio of inter-thread cache-line dirties. Depending on the CPU, a cache-line dirty followed by a peer-CPU read is SLOWER than a cache-miss - because you have to force the peer CPU to write it's value to main-mem before you can even start reading.. Some CPUs let you pull from peer cache-lines without hitting main-mem.

So the key is the absolutely minimize ANY shared modified memory structures.. Make everything as read-only as possible.. This INCLUDES share FIFO buffers (including Executor pools).. Namely if you used a synchronized queue - then every sync-op is a shared dirty memory region. And more-over, if the rate is high enough, it'll likely trigger an OS trap to stall, waiting for peer thread's mutex's.

The ideal is to segment RAM, distribute to a fixed number of workers a single large unit of work, then use a count-down-latch or some other memory barrier (such that each thread would only touch it once). Ideally any temporary buffers are pre-allocated instead of going into and out of a shared memory pool (which then causes cache contention). Java 'synchronized' blocks leverage (behind the scenes) a shared hash-table memory space and thus trigger the undesirable dirty-reads, I haven't determined if java 5 Lock objects avoid this, but you're still leveraging OS stalls which won't help in your throughput. Obviously most OutputStream operations trigger such synchronized calls (and of course are typically filling a common stream buffer).

Generally my experience is that single-threading is faster than mulithreading for a common byte-array/object-array, etc. At least with simplistic sorting/filtering algorithms that I've experimented with. This is true both in Java and C in my experience. I haven't tried FPU intesive ops (like divides, sqrt), where cache-lines may be less of a factor.

Basically if you're a single CPU you don't have cache-line problems (unless the OS is always flushing the cache even in shared threads), but multithreading buys you less than nothing. In hyperthreading, it's the same deal. In single-CPU shared L2/L3 cache configurations (e.g. AMDs), you might find some benefit. In multi CPU Intel BUS's, forget it - shared write-memory is worse than single-threading.


To measure how much time a context switch takes I would run something like the following:

public static void main(String[] args) {         Object theLock = new Object();     long startTime;    long endtime;    synchronized( theLock ){        Thread task = new TheTask( theLock );         task.start();        try {             theLock.wait();              endTime = System.currentTimeMillis();        }        catch( InterruptedException e ){             // do something if interrupted        }    }    System.out.println("Context Switch Time elapsed: " + endTime - startTime);}class TheTask extends Thread {    private Object theLock;    public TheTask( Object theLock ){        this.theLock = theLock;     }    public void run(){         synchronized( theLock ){            startTime = System.currentTimeMillis();            theLock.notify();         }    }}

You might want to run this code several times to get an average and make sure these two threads are the only ones that run in you machine (the context switch only happens within these two threads).


how much time cpu is used in switching threads instead of running them

  • Let's say you have 100 million FPU to perform.
  • Load them in a synchronized queue (i.e., threads must lock the queue when polling)
  • Let n be the number of processors available on your device (duo=2, etc...)

Then create n threads sucking on the queue to perform all FPU. You can compute total time with System.currentTimeMillis() before and after. Then try with n+1 threads, then n+2, n+3, etc...

In theory, the more threads you have, the more switching there will be, the more time it should take to process all FPU. It will give you a very rough idea of the switching overhead, but this is hard to measure.

how much synchronization traffic is created on shared memory bus - when threads share data, they must use synchronization mechanism

I would create 10 threads sending each 10 000 messages to another thread randomly by using a synchronized blocking queue of 100 messages. Each thread would peek the blocking queue to check whether the message is for them or not and pull it out if true. Then, they would try to push a message in without blocking, then repeat the peek operation, etc... until the queue is empty and all threads return.

On its way, each thread would could the number of successful push and peek/pull versus unsuccessful. Then, you would have a rough idea of useful work versus useless work in the synchronization traffic. Again, this is hard to measure.

Of course, you could play with the number of threads or the size of the blocking queue too.