Does the restrict keyword provide significant benefits in gcc/g++?
The restrict keyword does a difference.
I've seen improvements of factor 2 and more in some situations (image processing). Most of the time the difference is not that large though. About 10%.
Here is a little example that illustrate the difference. I've written a very basic 4x4 vector * matrix transform as a test. Note that I have to force the function not to be inlined. Otherwise GCC detects that there aren't any aliasing pointers in my benchmark code and restrict wouldn't make a difference due to inlining.
I could have moved the transform function to a different file as well.
#include <math.h>#ifdef USE_RESTRICT#else#define __restrict#endifvoid transform (float * __restrict dest, float * __restrict src, float * __restrict matrix, int n) __attribute__ ((noinline));void transform (float * __restrict dest, float * __restrict src, float * __restrict matrix, int n){ int i; // simple transform loop. // written with aliasing in mind. dest, src and matrix // are potentially aliasing, so the compiler is forced to reload // the values of matrix and src for each iteration. for (i=0; i<n; i++) { dest[0] = src[0] * matrix[0] + src[1] * matrix[1] + src[2] * matrix[2] + src[3] * matrix[3]; dest[1] = src[0] * matrix[4] + src[1] * matrix[5] + src[2] * matrix[6] + src[3] * matrix[7]; dest[2] = src[0] * matrix[8] + src[1] * matrix[9] + src[2] * matrix[10] + src[3] * matrix[11]; dest[3] = src[0] * matrix[12] + src[1] * matrix[13] + src[2] * matrix[14] + src[3] * matrix[15]; src += 4; dest += 4; }}float srcdata[4*10000];float dstdata[4*10000];int main (int argc, char**args){ int i,j; float matrix[16]; // init all source-data, so we don't get NANs for (i=0; i<16; i++) matrix[i] = 1; for (i=0; i<4*10000; i++) srcdata[i] = i; // do a bunch of tests for benchmarking. for (j=0; j<10000; j++) transform (dstdata, srcdata, matrix, 10000);}
Results: (on my 2 Ghz Core Duo)
nils@doofnase:~$ gcc -O3 test.cnils@doofnase:~$ time ./a.outreal 0m2.517suser 0m2.516ssys 0m0.004snils@doofnase:~$ gcc -O3 -DUSE_RESTRICT test.cnils@doofnase:~$ time ./a.outreal 0m2.034suser 0m2.028ssys 0m0.000s
Over the thumb 20% faster execution, on that system.
To show how much it depends on the architecture I've let the same code run on a Cortex-A8 embedded CPU (adjusted the loop count a bit cause I don't want to wait that long):
root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp test.croot@beagleboard:~# time ./a.outreal 0m 7.64suser 0m 7.62ssys 0m 0.00sroot@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -DUSE_RESTRICT test.c root@beagleboard:~# time ./a.outreal 0m 7.00suser 0m 6.98ssys 0m 0.00s
Here the difference is just 9% (same compiler btw.)
Does the restrict keyword provide significant benefits in gcc / g++ ?
It can reduce the number of instructions as shown on the example below, so use it whenever possible.
GCC 4.8 Linux x86-64 exmample
Input:
void f(int *a, int *b, int *x) { *a += *x; *b += *x;}void fr(int *restrict a, int *restrict b, int *restrict x) { *a += *x; *b += *x;}
Compile and decompile:
gcc -g -std=c99 -O0 -c main.cobjdump -S main.o
With -O0
, they are the same.
With -O3
:
void f(int *a, int *b, int *x) { *a += *x; 0: 8b 02 mov (%rdx),%eax 2: 01 07 add %eax,(%rdi) *b += *x; 4: 8b 02 mov (%rdx),%eax 6: 01 06 add %eax,(%rsi) void fr(int *restrict a, int *restrict b, int *restrict x) { *a += *x; 10: 8b 02 mov (%rdx),%eax 12: 01 07 add %eax,(%rdi) *b += *x; 14: 01 06 add %eax,(%rsi)
For the uninitiated, the calling convention is:
rdi
= first parameterrsi
= second parameterrdx
= third parameter
Conclusion: 3 instructions instead of 4.
Of course, instructions can have different latencies, but this gives a good idea.
Why GCC was able to optimize that?
The code above was taken from the Wikipedia example which is very illuminating.
Pseudo assembly for f
:
load R1 ← *x ; Load the value of x pointerload R2 ← *a ; Load the value of a pointeradd R2 += R1 ; Perform Additionset R2 → *a ; Update the value of a pointer; Similarly for b, note that x is loaded twice,; because x may point to a (a aliased by x) thus ; the value of x will change when the value of a; changes.load R1 ← *xload R2 ← *badd R2 += R1set R2 → *b
For fr
:
load R1 ← *xload R2 ← *aadd R2 += R1set R2 → *a; Note that x is not reloaded,; because the compiler knows it is unchanged; "load R1 ← *x" is no longer needed.load R2 ← *badd R2 += R1set R2 → *b
Is it really any faster?
Ermmm... not for this simple test:
.text .global _start _start: mov $0x10000000, %rbx mov $x, %rdx mov $x, %rdi mov $x, %rsi loop: # START of interesting block mov (%rdx),%eax add %eax,(%rdi) mov (%rdx),%eax # Comment out this line. add %eax,(%rsi) # END ------------------------ dec %rbx cmp $0, %rbx jnz loop mov $60, %rax mov $0, %rdi syscall.data x: .int 0
And then:
as -o a.o a.S && ld a.o && time ./a.out
on Ubuntu 14.04 AMD64 CPU Intel i5-3210M.
I confess that I still don't understand modern CPUs. Let me know if you:
- found a flaw in my method
- found an assembler test case where it becomes much faster
- understand why there wasn't a difference
The article Demystifying The Restrict Keyword refers to the paper Why Programmer-specified Aliasing is a Bad Idea (pdf) which says it generally doesn't help and provides measurements to back this up.