Multi-threaded random_r is slower than single threaded version
A very simple change to space the data out in memory:
struct random_data* rand_states = (struct random_data*)calloc(NTHREADS * 64, sizeof(struct random_data));char* rand_statebufs = (char*)calloc(NTHREADS*64, PRNG_BUFSZ);pthread_t* thread_ids;int t = 0;thread_ids = (pthread_t*)calloc(NTHREADS, sizeof(pthread_t));/* create threads */for (t = 0; t < NTHREADS; t++) { initstate_r(random(), &rand_statebufs[t*64], PRNG_BUFSZ, &rand_states[t*64]); pthread_create(&thread_ids[t], NULL, &thread_run, &rand_states[t*64]);}
results in a much faster running time on my dual-core machine.
This would confirm the suspicion it was meant to test - that you are mutating values on the same cache line in two separate threads, and so have cache contention. Herb Sutter's 'machine architecture - what your programming language never told you' talk is worth watching if you've got the time if you don't know about that yet, he demonstrates false sharing starting at around 1:20.
Work out your cache line size, and create each thread's data so it is aligned to it.
It's a bit cleaner to plonk all the thread's data into a struct, and align that:
#define CACHE_LINE_SIZE 64struct thread_data { struct random_data random_data; char statebuf[PRNG_BUFSZ]; char padding[CACHE_LINE_SIZE - sizeof ( struct random_data )-PRNG_BUFSZ];};int main ( int argc, char** argv ){ printf ( "%zd\n", sizeof ( struct thread_data ) ); void* apointer; if ( posix_memalign ( &apointer, sizeof ( struct thread_data ), NTHREADS * sizeof ( struct thread_data ) ) ) exit ( 1 ); struct thread_data* thread_states = apointer; memset ( apointer, 0, NTHREADS * sizeof ( struct thread_data ) ); pthread_t* thread_ids; int t = 0; thread_ids = ( pthread_t* ) calloc ( NTHREADS, sizeof ( pthread_t ) ); /* create threads */ for ( t = 0; t < NTHREADS; t++ ) { initstate_r ( random(), thread_states[t].statebuf, PRNG_BUFSZ, &thread_states[t].random_data ); pthread_create ( &thread_ids[t], NULL, &thread_run, &thread_states[t].random_data ); } for ( t = 0; t < NTHREADS; t++ ) { pthread_join ( thread_ids[t], NULL ); } free ( thread_ids ); free ( thread_states );}
with CACHE_LINE_SIZE
64:
refugio:$ gcc -O3 -o bin/nixuz_random_r src/nixuz_random_r.c -lpthreadrefugio:$ time bin/nixuz_random_r 6463499495944240966real 0m1.278suser 0m2.540ssys 0m0.000s
Or you can use double the cache line size, and use malloc - the extra padding ensures the mutated memory is on separate lines, as malloc is 16 (IIRC) rather than 64 byte aligned.
(I reduced ITERATIONS by a factor of ten rather than having a stupidly fast machine)
I don't know if this is relevant or not - but i just saw a very similar behavior (order of magnitude slower with 2 threads than with one) ... I basically changed a:
srand(seed); foo = rand();
to a
myseed = seed; foo = rand_r(&myseed);
and that "fixed" it (2 threads is now reliably almost twice as fast - e.g. 19s instead of 35s).
I don't know what the issue could have been -- locking or cache coherence on the internals of rand()
maybe? Anyway, there is also a random_r()
so maybe that would be of use to you (a year ago) or someone else.