Multi-threaded random_r is slower than single threaded version Multi-threaded random_r is slower than single threaded version multithreading multithreading

Multi-threaded random_r is slower than single threaded version


A very simple change to space the data out in memory:

struct random_data* rand_states = (struct random_data*)calloc(NTHREADS * 64, sizeof(struct random_data));char* rand_statebufs = (char*)calloc(NTHREADS*64, PRNG_BUFSZ);pthread_t* thread_ids;int t = 0;thread_ids = (pthread_t*)calloc(NTHREADS, sizeof(pthread_t));/* create threads */for (t = 0; t < NTHREADS; t++) {    initstate_r(random(), &rand_statebufs[t*64], PRNG_BUFSZ, &rand_states[t*64]);    pthread_create(&thread_ids[t], NULL, &thread_run, &rand_states[t*64]);}

results in a much faster running time on my dual-core machine.

This would confirm the suspicion it was meant to test - that you are mutating values on the same cache line in two separate threads, and so have cache contention. Herb Sutter's 'machine architecture - what your programming language never told you' talk is worth watching if you've got the time if you don't know about that yet, he demonstrates false sharing starting at around 1:20.

Work out your cache line size, and create each thread's data so it is aligned to it.

It's a bit cleaner to plonk all the thread's data into a struct, and align that:

#define CACHE_LINE_SIZE 64struct thread_data {    struct random_data random_data;    char statebuf[PRNG_BUFSZ];    char padding[CACHE_LINE_SIZE - sizeof ( struct random_data )-PRNG_BUFSZ];};int main ( int argc, char** argv ){    printf ( "%zd\n", sizeof ( struct thread_data ) );    void* apointer;    if ( posix_memalign ( &apointer, sizeof ( struct thread_data ), NTHREADS * sizeof ( struct thread_data ) ) )        exit ( 1 );    struct thread_data* thread_states = apointer;    memset ( apointer, 0, NTHREADS * sizeof ( struct thread_data ) );    pthread_t* thread_ids;    int t = 0;    thread_ids = ( pthread_t* ) calloc ( NTHREADS, sizeof ( pthread_t ) );    /* create threads */    for ( t = 0; t < NTHREADS; t++ ) {        initstate_r ( random(), thread_states[t].statebuf, PRNG_BUFSZ, &thread_states[t].random_data );        pthread_create ( &thread_ids[t], NULL, &thread_run, &thread_states[t].random_data );    }    for ( t = 0; t < NTHREADS; t++ ) {        pthread_join ( thread_ids[t], NULL );    }    free ( thread_ids );    free ( thread_states );}

with CACHE_LINE_SIZE 64:

refugio:$ gcc -O3 -o bin/nixuz_random_r src/nixuz_random_r.c -lpthreadrefugio:$ time bin/nixuz_random_r 6463499495944240966real    0m1.278suser    0m2.540ssys 0m0.000s

Or you can use double the cache line size, and use malloc - the extra padding ensures the mutated memory is on separate lines, as malloc is 16 (IIRC) rather than 64 byte aligned.

(I reduced ITERATIONS by a factor of ten rather than having a stupidly fast machine)


I don't know if this is relevant or not - but i just saw a very similar behavior (order of magnitude slower with 2 threads than with one) ... I basically changed a:

  srand(seed);  foo = rand();

to a

  myseed = seed;  foo = rand_r(&myseed);

and that "fixed" it (2 threads is now reliably almost twice as fast - e.g. 19s instead of 35s).

I don't know what the issue could have been -- locking or cache coherence on the internals of rand() maybe? Anyway, there is also a random_r() so maybe that would be of use to you (a year ago) or someone else.