Does the restrict keyword provide significant benefits in gcc/g++? Does the restrict keyword provide significant benefits in gcc/g++? c c

Does the restrict keyword provide significant benefits in gcc/g++?


The restrict keyword does a difference.

I've seen improvements of factor 2 and more in some situations (image processing). Most of the time the difference is not that large though. About 10%.

Here is a little example that illustrate the difference. I've written a very basic 4x4 vector * matrix transform as a test. Note that I have to force the function not to be inlined. Otherwise GCC detects that there aren't any aliasing pointers in my benchmark code and restrict wouldn't make a difference due to inlining.

I could have moved the transform function to a different file as well.

#include <math.h>#ifdef USE_RESTRICT#else#define __restrict#endifvoid transform (float * __restrict dest, float * __restrict src,                 float * __restrict matrix, int n) __attribute__ ((noinline));void transform (float * __restrict dest, float * __restrict src,                 float * __restrict matrix, int n){  int i;  // simple transform loop.  // written with aliasing in mind. dest, src and matrix   // are potentially aliasing, so the compiler is forced to reload  // the values of matrix and src for each iteration.  for (i=0; i<n; i++)  {    dest[0] = src[0] * matrix[0] + src[1] * matrix[1] +               src[2] * matrix[2] + src[3] * matrix[3];    dest[1] = src[0] * matrix[4] + src[1] * matrix[5] +               src[2] * matrix[6] + src[3] * matrix[7];    dest[2] = src[0] * matrix[8] + src[1] * matrix[9] +               src[2] * matrix[10] + src[3] * matrix[11];    dest[3] = src[0] * matrix[12] + src[1] * matrix[13] +               src[2] * matrix[14] + src[3] * matrix[15];    src  += 4;    dest += 4;  }}float srcdata[4*10000];float dstdata[4*10000];int main (int argc, char**args){  int i,j;  float matrix[16];  // init all source-data, so we don't get NANs    for (i=0; i<16; i++)   matrix[i] = 1;  for (i=0; i<4*10000; i++) srcdata[i] = i;  // do a bunch of tests for benchmarking.   for (j=0; j<10000; j++)    transform (dstdata, srcdata, matrix, 10000);}

Results: (on my 2 Ghz Core Duo)

nils@doofnase:~$ gcc -O3 test.cnils@doofnase:~$ time ./a.outreal    0m2.517suser    0m2.516ssys     0m0.004snils@doofnase:~$ gcc -O3 -DUSE_RESTRICT test.cnils@doofnase:~$ time ./a.outreal    0m2.034suser    0m2.028ssys     0m0.000s

Over the thumb 20% faster execution, on that system.

To show how much it depends on the architecture I've let the same code run on a Cortex-A8 embedded CPU (adjusted the loop count a bit cause I don't want to wait that long):

root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp test.croot@beagleboard:~# time ./a.outreal    0m 7.64suser    0m 7.62ssys     0m 0.00sroot@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -DUSE_RESTRICT test.c root@beagleboard:~# time ./a.outreal    0m 7.00suser    0m 6.98ssys     0m 0.00s

Here the difference is just 9% (same compiler btw.)


Does the restrict keyword provide significant benefits in gcc / g++ ?

It can reduce the number of instructions as shown on the example below, so use it whenever possible.

GCC 4.8 Linux x86-64 exmample

Input:

void f(int *a, int *b, int *x) {  *a += *x;  *b += *x;}void fr(int *restrict a, int *restrict b, int *restrict x) {  *a += *x;  *b += *x;}

Compile and decompile:

gcc -g -std=c99 -O0 -c main.cobjdump -S main.o

With -O0, they are the same.

With -O3:

void f(int *a, int *b, int *x) {    *a += *x;   0:   8b 02                   mov    (%rdx),%eax   2:   01 07                   add    %eax,(%rdi)    *b += *x;   4:   8b 02                   mov    (%rdx),%eax   6:   01 06                   add    %eax,(%rsi)  void fr(int *restrict a, int *restrict b, int *restrict x) {    *a += *x;  10:   8b 02                   mov    (%rdx),%eax  12:   01 07                   add    %eax,(%rdi)    *b += *x;  14:   01 06                   add    %eax,(%rsi) 

For the uninitiated, the calling convention is:

  • rdi = first parameter
  • rsi = second parameter
  • rdx = third parameter

Conclusion: 3 instructions instead of 4.

Of course, instructions can have different latencies, but this gives a good idea.

Why GCC was able to optimize that?

The code above was taken from the Wikipedia example which is very illuminating.

Pseudo assembly for f:

load R1 ← *x    ; Load the value of x pointerload R2 ← *a    ; Load the value of a pointeradd R2 += R1    ; Perform Additionset R2 → *a     ; Update the value of a pointer; Similarly for b, note that x is loaded twice,; because x may point to a (a aliased by x) thus ; the value of x will change when the value of a; changes.load R1 ← *xload R2 ← *badd R2 += R1set R2 → *b

For fr:

load R1 ← *xload R2 ← *aadd R2 += R1set R2 → *a; Note that x is not reloaded,; because the compiler knows it is unchanged; "load R1 ← *x" is no longer needed.load R2 ← *badd R2 += R1set R2 → *b

Is it really any faster?

Ermmm... not for this simple test:

.text    .global _start    _start:        mov $0x10000000, %rbx        mov $x, %rdx        mov $x, %rdi        mov $x, %rsi    loop:        # START of interesting block        mov (%rdx),%eax        add %eax,(%rdi)        mov (%rdx),%eax # Comment out this line.        add %eax,(%rsi)        # END ------------------------        dec %rbx        cmp $0, %rbx        jnz loop        mov $60, %rax        mov $0, %rdi        syscall.data    x:        .int 0

And then:

as -o a.o a.S && ld a.o && time ./a.out

on Ubuntu 14.04 AMD64 CPU Intel i5-3210M.

I confess that I still don't understand modern CPUs. Let me know if you:

  • found a flaw in my method
  • found an assembler test case where it becomes much faster
  • understand why there wasn't a difference


The article Demystifying The Restrict Keyword refers to the paper Why Programmer-specified Aliasing is a Bad Idea (pdf) which says it generally doesn't help and provides measurements to back this up.