Multi-threaded C program much slower in OS X than Linux Multi-threaded C program much slower in OS X than Linux multithreading multithreading

Multi-threaded C program much slower in OS X than Linux


MacOSX and Linux implement pthread differently, causing this slow behavior. Specifically MacOSX does not use spinlocks (they are optional according to ISO C standard). This can lead to very, very slow code performance with examples like this one.


I've duplicated your result to a goodly extent (without the sweeper):

#include <stdlib.h>#include <stdio.h>#include <pthread.h>pthread_mutex_t Lock;pthread_t       LastThread;int             Array[100];void *foo(void *arg){  pthread_t self  = pthread_self();  int num_in_row  = 1;  int num_streaks = 0;  double avg_strk = 0.0;  int i;  for (i = 0; i < 1000000; ++i)  {    int p1 = (int) (100.0 * rand() / (RAND_MAX - 1));    int p2 = (int) (100.0 * rand() / (RAND_MAX - 1));    pthread_mutex_lock(&Lock);    {      int tmp   = Array[p1];      Array[p1] = Array[p2];      Array[p2] = tmp;      if (pthread_equal(LastThread, self))        ++num_in_row;      else      {        ++num_streaks;        avg_strk += (num_in_row - avg_strk) / num_streaks;        num_in_row = 1;        LastThread = self;      }    }    pthread_mutex_unlock(&Lock);  }  fprintf(stdout, "Thread exiting with avg streak length %lf\n", avg_strk);  return NULL;}int main(int argc, char **argv){  int       num_threads = (argc > 1 ? atoi(argv[1]) : 40);  pthread_t thrs[num_threads];  void     *ret;  int       i;  if (pthread_mutex_init(&Lock, NULL))  {    perror("pthread_mutex_init failed!");    return 1;  }  for (i = 0; i < 100; ++i)    Array[i] = i;  for (i = 0; i < num_threads; ++i)    if (pthread_create(&thrs[i], NULL, foo, NULL))    {      perror("pthread create failed!");      return 1;    }  for (i = 0; i < num_threads; ++i)    if (pthread_join(thrs[i], &ret))    {      perror("pthread join failed!");      return 1;    }  /*  for (i = 0; i < 100; ++i)    printf("%d\n", Array[i]);  printf("Goodbye!\n");  */  return 0;}

On a Linux (2.6.18-308.24.1.el5) server Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz

[ltn@svg-dc60-t1 ~]$ time ./a.out 1real    0m0.068suser    0m0.068ssys 0m0.001s[ltn@svg-dc60-t1 ~]$ time ./a.out 2real    0m0.378suser    0m0.443ssys 0m0.135s[ltn@svg-dc60-t1 ~]$ time ./a.out 3real    0m0.899suser    0m0.956ssys 0m0.941s[ltn@svg-dc60-t1 ~]$ time ./a.out 4real    0m1.472suser    0m1.472ssys 0m2.686s[ltn@svg-dc60-t1 ~]$ time ./a.out 5real    0m1.720suser    0m1.660ssys 0m4.591s[ltn@svg-dc60-t1 ~]$ time ./a.out 40real    0m11.245suser    0m13.716ssys 1m14.896s

On my MacBook Pro (Yosemite 10.10.2) 2.6 GHz i7, 16 GB memory

john-schultzs-macbook-pro:~ jschultz$ time ./a.out 1real    0m0.057suser    0m0.054ssys 0m0.002sjohn-schultzs-macbook-pro:~ jschultz$ time ./a.out 2real    0m5.684suser    0m1.148ssys 0m5.353sjohn-schultzs-macbook-pro:~ jschultz$ time ./a.out 3real    0m8.946suser    0m1.967ssys 0m8.034sjohn-schultzs-macbook-pro:~ jschultz$ time ./a.out 4real    0m11.980suser    0m2.274ssys 0m10.801sjohn-schultzs-macbook-pro:~ jschultz$ time ./a.out 5real    0m15.680suser    0m3.307ssys 0m14.158sjohn-schultzs-macbook-pro:~ jschultz$ time ./a.out 40real    2m7.377suser    0m23.926ssys 2m2.434s

It took my Mac ~12x times as much wall clock time to complete with 40 threads and that's versus a very old version of Linux + gcc.

NOTE: I changed my code to do 1M swaps per thread.

It looks like under contention OSX is doing a lot more work than Linux. Maybe it is interleaving them much more finely than Linux does?

EDIT Updated code to record avg number of times a thread re-captures the lock immediately:

Linux

[ltn@svg-dc60-t1 ~]$ time ./a.out 10Thread exiting with avg streak length 2.103567Thread exiting with avg streak length 2.156641Thread exiting with avg streak length 2.101194Thread exiting with avg streak length 2.068383Thread exiting with avg streak length 2.110132Thread exiting with avg streak length 2.046878Thread exiting with avg streak length 2.087338Thread exiting with avg streak length 2.049701Thread exiting with avg streak length 2.041052Thread exiting with avg streak length 2.048456real    0m2.837suser    0m3.012ssys 0m16.040s

Mac OSX

john-schultzs-macbook-pro:~ jschultz$ time ./a.out 10Thread exiting with avg streak length 1.000000Thread exiting with avg streak length 1.000000Thread exiting with avg streak length 1.000000Thread exiting with avg streak length 1.000000Thread exiting with avg streak length 1.000000Thread exiting with avg streak length 1.000000Thread exiting with avg streak length 1.000000Thread exiting with avg streak length 1.000000Thread exiting with avg streak length 1.000000Thread exiting with avg streak length 1.000000real    0m34.163suser    0m5.902ssys 0m30.329s

So, OSX is sharing its locks much more evenly and therefore has many more thread suspensions and resumptions.


The OP does not mention/show any code that indicates the thread(s) sleep, wait, give up execution, etc and all the threads are at the same 'nice' level.  

so an individual thread may well get the CPU and not release it until it has completed all 2mil executions.

This would result in a minimal amount of time performing context switches, on linux.

However, on the MAC OS, a execution is only given a 'time slice' to execute, before another 'ready to execute' thread/process is allowed to execute.

This means many many more context switches.

Context switches are performed in 'sys' time.

The result is the MAC OS will take much longer to execute.

To even the playing field, you could force context switches, by inserting a nanosleep() or a call to release execution via

#include <sched.h>then callingint sched_yield(void);