Efficient Algorithm for Bit Reversal (from MSB->LSB to LSB->MSB) in C Efficient Algorithm for Bit Reversal (from MSB->LSB to LSB->MSB) in C c c

Efficient Algorithm for Bit Reversal (from MSB->LSB to LSB->MSB) in C


NOTE: All algorithms below are in C, but should be portable to your language of choice (just don't look at me when they're not as fast :)

Options

Low Memory (32-bit int, 32-bit machine)(from here):

unsigned intreverse(register unsigned int x){    x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));    x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));    x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));    x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));    return((x >> 16) | (x << 16));}

From the famous Bit Twiddling Hacks page:

Fastest (lookup table):

static const unsigned char BitReverseTable256[] = {  0x00, 0x80, 0x40, 0xC0, 0x20, 0xA0, 0x60, 0xE0, 0x10, 0x90, 0x50, 0xD0, 0x30, 0xB0, 0x70, 0xF0,   0x08, 0x88, 0x48, 0xC8, 0x28, 0xA8, 0x68, 0xE8, 0x18, 0x98, 0x58, 0xD8, 0x38, 0xB8, 0x78, 0xF8,   0x04, 0x84, 0x44, 0xC4, 0x24, 0xA4, 0x64, 0xE4, 0x14, 0x94, 0x54, 0xD4, 0x34, 0xB4, 0x74, 0xF4,   0x0C, 0x8C, 0x4C, 0xCC, 0x2C, 0xAC, 0x6C, 0xEC, 0x1C, 0x9C, 0x5C, 0xDC, 0x3C, 0xBC, 0x7C, 0xFC,   0x02, 0x82, 0x42, 0xC2, 0x22, 0xA2, 0x62, 0xE2, 0x12, 0x92, 0x52, 0xD2, 0x32, 0xB2, 0x72, 0xF2,   0x0A, 0x8A, 0x4A, 0xCA, 0x2A, 0xAA, 0x6A, 0xEA, 0x1A, 0x9A, 0x5A, 0xDA, 0x3A, 0xBA, 0x7A, 0xFA,  0x06, 0x86, 0x46, 0xC6, 0x26, 0xA6, 0x66, 0xE6, 0x16, 0x96, 0x56, 0xD6, 0x36, 0xB6, 0x76, 0xF6,   0x0E, 0x8E, 0x4E, 0xCE, 0x2E, 0xAE, 0x6E, 0xEE, 0x1E, 0x9E, 0x5E, 0xDE, 0x3E, 0xBE, 0x7E, 0xFE,  0x01, 0x81, 0x41, 0xC1, 0x21, 0xA1, 0x61, 0xE1, 0x11, 0x91, 0x51, 0xD1, 0x31, 0xB1, 0x71, 0xF1,  0x09, 0x89, 0x49, 0xC9, 0x29, 0xA9, 0x69, 0xE9, 0x19, 0x99, 0x59, 0xD9, 0x39, 0xB9, 0x79, 0xF9,   0x05, 0x85, 0x45, 0xC5, 0x25, 0xA5, 0x65, 0xE5, 0x15, 0x95, 0x55, 0xD5, 0x35, 0xB5, 0x75, 0xF5,  0x0D, 0x8D, 0x4D, 0xCD, 0x2D, 0xAD, 0x6D, 0xED, 0x1D, 0x9D, 0x5D, 0xDD, 0x3D, 0xBD, 0x7D, 0xFD,  0x03, 0x83, 0x43, 0xC3, 0x23, 0xA3, 0x63, 0xE3, 0x13, 0x93, 0x53, 0xD3, 0x33, 0xB3, 0x73, 0xF3,   0x0B, 0x8B, 0x4B, 0xCB, 0x2B, 0xAB, 0x6B, 0xEB, 0x1B, 0x9B, 0x5B, 0xDB, 0x3B, 0xBB, 0x7B, 0xFB,  0x07, 0x87, 0x47, 0xC7, 0x27, 0xA7, 0x67, 0xE7, 0x17, 0x97, 0x57, 0xD7, 0x37, 0xB7, 0x77, 0xF7,   0x0F, 0x8F, 0x4F, 0xCF, 0x2F, 0xAF, 0x6F, 0xEF, 0x1F, 0x9F, 0x5F, 0xDF, 0x3F, 0xBF, 0x7F, 0xFF};unsigned int v; // reverse 32-bit value, 8 bits at timeunsigned int c; // c will get v reversed// Option 1:c = (BitReverseTable256[v & 0xff] << 24) |     (BitReverseTable256[(v >> 8) & 0xff] << 16) |     (BitReverseTable256[(v >> 16) & 0xff] << 8) |    (BitReverseTable256[(v >> 24) & 0xff]);// Option 2:unsigned char * p = (unsigned char *) &v;unsigned char * q = (unsigned char *) &c;q[3] = BitReverseTable256[p[0]]; q[2] = BitReverseTable256[p[1]]; q[1] = BitReverseTable256[p[2]]; q[0] = BitReverseTable256[p[3]];

You can extend this idea to 64-bit ints, or trade off memory for speed (assuming your L1 Data Cache is large enough), and reverse 16 bits at a time with a 64K-entry lookup table.


Others

Simple

unsigned int v;     // input bits to be reversedunsigned int r = v & 1; // r will be reversed bits of v; first get LSB of vint s = sizeof(v) * CHAR_BIT - 1; // extra shift needed at endfor (v >>= 1; v; v >>= 1){     r <<= 1;  r |= v & 1;  s--;}r <<= s; // shift when v's highest bits are zero

Faster (32-bit processor)

unsigned char b = x;b = ((b * 0x0802LU & 0x22110LU) | (b * 0x8020LU & 0x88440LU)) * 0x10101LU >> 16; 

Faster (64-bit processor)

unsigned char b; // reverse this (8-bit) byteb = (b * 0x0202020202ULL & 0x010884422010ULL) % 1023;

If you want to do this on a 32-bit int, just reverse the bits in each byte, and reverse the order of the bytes. That is:

unsigned int toReverse;unsigned int reversed;unsigned char inByte0 = (toReverse & 0xFF);unsigned char inByte1 = (toReverse & 0xFF00) >> 8;unsigned char inByte2 = (toReverse & 0xFF0000) >> 16;unsigned char inByte3 = (toReverse & 0xFF000000) >> 24;reversed = (reverseBits(inByte0) << 24) | (reverseBits(inByte1) << 16) | (reverseBits(inByte2) << 8) | (reverseBits(inByte3);

Results

I benchmarked the two most promising solutions, the lookup table, and bitwise-AND (the first one). The test machine is a laptop w/ 4GB of DDR2-800 and a Core 2 Duo T7500 @ 2.4GHz, 4MB L2 Cache; YMMV. I used gcc 4.3.2 on 64-bit Linux. OpenMP (and the GCC bindings) were used for high-resolution timers.

reverse.c

#include <stdlib.h>#include <stdio.h>#include <omp.h>unsigned intreverse(register unsigned int x){    x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));    x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));    x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));    x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));    return((x >> 16) | (x << 16));}int main(){    unsigned int *ints = malloc(100000000*sizeof(unsigned int));    unsigned int *ints2 = malloc(100000000*sizeof(unsigned int));    for(unsigned int i = 0; i < 100000000; i++)      ints[i] = rand();    unsigned int *inptr = ints;    unsigned int *outptr = ints2;    unsigned int *endptr = ints + 100000000;    // Starting the time measurement    double start = omp_get_wtime();    // Computations to be measured    while(inptr != endptr)    {      (*outptr) = reverse(*inptr);      inptr++;      outptr++;    }    // Measuring the elapsed time    double end = omp_get_wtime();    // Time calculation (in seconds)    printf("Time: %f seconds\n", end-start);    free(ints);    free(ints2);    return 0;}

reverse_lookup.c

#include <stdlib.h>#include <stdio.h>#include <omp.h>static const unsigned char BitReverseTable256[] = {  0x00, 0x80, 0x40, 0xC0, 0x20, 0xA0, 0x60, 0xE0, 0x10, 0x90, 0x50, 0xD0, 0x30, 0xB0, 0x70, 0xF0,   0x08, 0x88, 0x48, 0xC8, 0x28, 0xA8, 0x68, 0xE8, 0x18, 0x98, 0x58, 0xD8, 0x38, 0xB8, 0x78, 0xF8,   0x04, 0x84, 0x44, 0xC4, 0x24, 0xA4, 0x64, 0xE4, 0x14, 0x94, 0x54, 0xD4, 0x34, 0xB4, 0x74, 0xF4,   0x0C, 0x8C, 0x4C, 0xCC, 0x2C, 0xAC, 0x6C, 0xEC, 0x1C, 0x9C, 0x5C, 0xDC, 0x3C, 0xBC, 0x7C, 0xFC,   0x02, 0x82, 0x42, 0xC2, 0x22, 0xA2, 0x62, 0xE2, 0x12, 0x92, 0x52, 0xD2, 0x32, 0xB2, 0x72, 0xF2,   0x0A, 0x8A, 0x4A, 0xCA, 0x2A, 0xAA, 0x6A, 0xEA, 0x1A, 0x9A, 0x5A, 0xDA, 0x3A, 0xBA, 0x7A, 0xFA,  0x06, 0x86, 0x46, 0xC6, 0x26, 0xA6, 0x66, 0xE6, 0x16, 0x96, 0x56, 0xD6, 0x36, 0xB6, 0x76, 0xF6,   0x0E, 0x8E, 0x4E, 0xCE, 0x2E, 0xAE, 0x6E, 0xEE, 0x1E, 0x9E, 0x5E, 0xDE, 0x3E, 0xBE, 0x7E, 0xFE,  0x01, 0x81, 0x41, 0xC1, 0x21, 0xA1, 0x61, 0xE1, 0x11, 0x91, 0x51, 0xD1, 0x31, 0xB1, 0x71, 0xF1,  0x09, 0x89, 0x49, 0xC9, 0x29, 0xA9, 0x69, 0xE9, 0x19, 0x99, 0x59, 0xD9, 0x39, 0xB9, 0x79, 0xF9,   0x05, 0x85, 0x45, 0xC5, 0x25, 0xA5, 0x65, 0xE5, 0x15, 0x95, 0x55, 0xD5, 0x35, 0xB5, 0x75, 0xF5,  0x0D, 0x8D, 0x4D, 0xCD, 0x2D, 0xAD, 0x6D, 0xED, 0x1D, 0x9D, 0x5D, 0xDD, 0x3D, 0xBD, 0x7D, 0xFD,  0x03, 0x83, 0x43, 0xC3, 0x23, 0xA3, 0x63, 0xE3, 0x13, 0x93, 0x53, 0xD3, 0x33, 0xB3, 0x73, 0xF3,   0x0B, 0x8B, 0x4B, 0xCB, 0x2B, 0xAB, 0x6B, 0xEB, 0x1B, 0x9B, 0x5B, 0xDB, 0x3B, 0xBB, 0x7B, 0xFB,  0x07, 0x87, 0x47, 0xC7, 0x27, 0xA7, 0x67, 0xE7, 0x17, 0x97, 0x57, 0xD7, 0x37, 0xB7, 0x77, 0xF7,   0x0F, 0x8F, 0x4F, 0xCF, 0x2F, 0xAF, 0x6F, 0xEF, 0x1F, 0x9F, 0x5F, 0xDF, 0x3F, 0xBF, 0x7F, 0xFF};int main(){    unsigned int *ints = malloc(100000000*sizeof(unsigned int));    unsigned int *ints2 = malloc(100000000*sizeof(unsigned int));    for(unsigned int i = 0; i < 100000000; i++)      ints[i] = rand();    unsigned int *inptr = ints;    unsigned int *outptr = ints2;    unsigned int *endptr = ints + 100000000;    // Starting the time measurement    double start = omp_get_wtime();    // Computations to be measured    while(inptr != endptr)    {    unsigned int in = *inptr;      // Option 1:    //*outptr = (BitReverseTable256[in & 0xff] << 24) |     //    (BitReverseTable256[(in >> 8) & 0xff] << 16) |     //    (BitReverseTable256[(in >> 16) & 0xff] << 8) |    //    (BitReverseTable256[(in >> 24) & 0xff]);    // Option 2:    unsigned char * p = (unsigned char *) &(*inptr);    unsigned char * q = (unsigned char *) &(*outptr);    q[3] = BitReverseTable256[p[0]];     q[2] = BitReverseTable256[p[1]];     q[1] = BitReverseTable256[p[2]];     q[0] = BitReverseTable256[p[3]];      inptr++;      outptr++;    }    // Measuring the elapsed time    double end = omp_get_wtime();    // Time calculation (in seconds)    printf("Time: %f seconds\n", end-start);    free(ints);    free(ints2);    return 0;}

I tried both approaches at several different optimizations, ran 3 trials at each level, and each trial reversed 100 million random unsigned ints. For the lookup table option, I tried both schemes (options 1 and 2) given on the bitwise hacks page. Results are shown below.

Bitwise AND

mrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -o reverse reverse.cmrj10@mjlap:~/code$ ./reverseTime: 2.000593 secondsmrj10@mjlap:~/code$ ./reverseTime: 1.938893 secondsmrj10@mjlap:~/code$ ./reverseTime: 1.936365 secondsmrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -O2 -o reverse reverse.cmrj10@mjlap:~/code$ ./reverseTime: 0.942709 secondsmrj10@mjlap:~/code$ ./reverseTime: 0.991104 secondsmrj10@mjlap:~/code$ ./reverseTime: 0.947203 secondsmrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -O3 -o reverse reverse.cmrj10@mjlap:~/code$ ./reverseTime: 0.922639 secondsmrj10@mjlap:~/code$ ./reverseTime: 0.892372 secondsmrj10@mjlap:~/code$ ./reverseTime: 0.891688 seconds

Lookup Table (option 1)

mrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -o reverse_lookup reverse_lookup.cmrj10@mjlap:~/code$ ./reverse_lookupTime: 1.201127 seconds              mrj10@mjlap:~/code$ ./reverse_lookupTime: 1.196129 seconds              mrj10@mjlap:~/code$ ./reverse_lookupTime: 1.235972 seconds              mrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -O2 -o reverse_lookup reverse_lookup.cmrj10@mjlap:~/code$ ./reverse_lookupTime: 0.633042 seconds              mrj10@mjlap:~/code$ ./reverse_lookupTime: 0.655880 seconds              mrj10@mjlap:~/code$ ./reverse_lookupTime: 0.633390 seconds              mrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -O3 -o reverse_lookup reverse_lookup.cmrj10@mjlap:~/code$ ./reverse_lookupTime: 0.652322 seconds              mrj10@mjlap:~/code$ ./reverse_lookupTime: 0.631739 seconds              mrj10@mjlap:~/code$ ./reverse_lookupTime: 0.652431 seconds  

Lookup Table (option 2)

mrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -o reverse_lookup reverse_lookup.cmrj10@mjlap:~/code$ ./reverse_lookupTime: 1.671537 secondsmrj10@mjlap:~/code$ ./reverse_lookupTime: 1.688173 secondsmrj10@mjlap:~/code$ ./reverse_lookupTime: 1.664662 secondsmrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -O2 -o reverse_lookup reverse_lookup.cmrj10@mjlap:~/code$ ./reverse_lookupTime: 1.049851 secondsmrj10@mjlap:~/code$ ./reverse_lookupTime: 1.048403 secondsmrj10@mjlap:~/code$ ./reverse_lookupTime: 1.085086 secondsmrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -O3 -o reverse_lookup reverse_lookup.cmrj10@mjlap:~/code$ ./reverse_lookupTime: 1.082223 secondsmrj10@mjlap:~/code$ ./reverse_lookupTime: 1.053431 secondsmrj10@mjlap:~/code$ ./reverse_lookupTime: 1.081224 seconds

Conclusion

Use the lookup table, with option 1 (byte addressing is unsurprisingly slow) if you're concerned about performance. If you need to squeeze every last byte of memory out of your system (and you might, if you care about the performance of bit reversal), the optimized versions of the bitwise-AND approach aren't too shabby either.

Caveat

Yes, I know the benchmark code is a complete hack. Suggestions on how to improve it are more than welcome. Things I know about:

  • I don't have access to ICC. This may be faster (please respond in a comment if you can test this out).
  • A 64K lookup table may do well on some modern microarchitectures with large L1D.
  • -mtune=native didn't work for -O2/-O3 (ld blew up with some crazy symbol redefinition error), so I don't believe the generated code is tuned for my microarchitecture.
  • There may be a way to do this slightly faster with SSE. I have no idea how, but with fast replication, packed bitwise AND, and swizzling instructions, there's got to be something there.
  • I know only enough x86 assembly to be dangerous; here's the code GCC generated on -O3 for option 1, so somebody more knowledgable than myself can check it out:

32-bit

.L3:movl    (%r12,%rsi), %ecxmovzbl  %cl, %eaxmovzbl  BitReverseTable256(%rax), %edxmovl    %ecx, %eaxshrl    $24, %eaxmov     %eax, %eaxmovzbl  BitReverseTable256(%rax), %eaxsall    $24, %edxorl     %eax, %edxmovzbl  %ch, %eaxshrl    $16, %ecxmovzbl  BitReverseTable256(%rax), %eaxmovzbl  %cl, %ecxsall    $16, %eaxorl     %eax, %edxmovzbl  BitReverseTable256(%rcx), %eaxsall    $8, %eaxorl     %eax, %edxmovl    %edx, (%r13,%rsi)addq    $4, %rsicmpq    $400000000, %rsijne     .L3

EDIT: I also tried using uint64_t types on my machine to see if there was any performance boost. Performance was about 10% faster than 32-bit, and was nearly identical whether you were just using 64-bit types to reverse bits on two 32-bit int types at a time, or whether you were actually reversing bits in half as many 64-bit values. The assembly code is shown below (for the former case, reversing bits for two 32-bit int types at a time):

.L3:movq    (%r12,%rsi), %rdxmovq    %rdx, %raxshrq    $24, %raxandl    $255, %eaxmovzbl  BitReverseTable256(%rax), %ecxmovzbq  %dl,%raxmovzbl  BitReverseTable256(%rax), %eaxsalq    $24, %raxorq     %rax, %rcxmovq    %rdx, %raxshrq    $56, %raxmovzbl  BitReverseTable256(%rax), %eaxsalq    $32, %raxorq     %rax, %rcxmovzbl  %dh, %eaxshrq    $16, %rdxmovzbl  BitReverseTable256(%rax), %eaxsalq    $16, %raxorq     %rax, %rcxmovzbq  %dl,%raxshrq    $16, %rdxmovzbl  BitReverseTable256(%rax), %eaxsalq    $8, %raxorq     %rax, %rcxmovzbq  %dl,%raxshrq    $8, %rdxmovzbl  BitReverseTable256(%rax), %eaxsalq    $56, %raxorq     %rax, %rcxmovzbq  %dl,%raxshrq    $8, %rdxmovzbl  BitReverseTable256(%rax), %eaxandl    $255, %edxsalq    $48, %raxorq     %rax, %rcxmovzbl  BitReverseTable256(%rdx), %eaxsalq    $40, %raxorq     %rax, %rcxmovq    %rcx, (%r13,%rsi)addq    $8, %rsicmpq    $400000000, %rsijne     .L3


This thread caught my attention since it deals with a simple problem that requires a lot of work (CPU cycles) even for a modern CPU. And one day I also stood there with the same ยค#%"#" problem. I had to flip millions of bytes. However I know all my target systems are modern Intel-based so let's start optimizing to the extreme!!!

So I used Matt J's lookup code as the base. the system I'm benchmarking on is a i7 haswell 4700eq.

Matt J's lookup bitflipping 400 000 000 bytes: Around 0.272 seconds.

I then went ahead and tried to see if Intel's ISPC compiler could vectorise the arithmetics in the reverse.c.

I'm not going to bore you with my findings here since I tried a lot to help the compiler find stuff, anyhow I ended up with performance of around 0.15 seconds to bitflip 400 000 000 bytes. It's a great reduction but for my application that's still way way too slow..

So people let me present the fastest Intel based bitflipper in the world. Clocked at:

Time to bitflip 400000000 bytes: 0.050082 seconds !!!!!

// Bitflip using AVX2 - The fastest Intel based bitflip in the world!!// Made by Anders Cedronius 2014 (anders.cedronius (you know what) gmail.com)#include <stdio.h>#include <stdlib.h>#include <math.h>#include <omp.h>using namespace std;#define DISPLAY_HEIGHT  4#define DISPLAY_WIDTH   32#define NUM_DATA_BYTES  400000000// Constants (first we got the mask, then the high order nibble look up table and last we got the low order nibble lookup table)__attribute__ ((aligned(32))) static unsigned char k1[32*3]={        0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,        0x00,0x08,0x04,0x0c,0x02,0x0a,0x06,0x0e,0x01,0x09,0x05,0x0d,0x03,0x0b,0x07,0x0f,0x00,0x08,0x04,0x0c,0x02,0x0a,0x06,0x0e,0x01,0x09,0x05,0x0d,0x03,0x0b,0x07,0x0f,        0x00,0x80,0x40,0xc0,0x20,0xa0,0x60,0xe0,0x10,0x90,0x50,0xd0,0x30,0xb0,0x70,0xf0,0x00,0x80,0x40,0xc0,0x20,0xa0,0x60,0xe0,0x10,0x90,0x50,0xd0,0x30,0xb0,0x70,0xf0};// The data to be bitflipped (+32 to avoid the quantization out of memory problem)__attribute__ ((aligned(32))) static unsigned char data[NUM_DATA_BYTES+32]={};extern "C" {void bitflipbyte(unsigned char[],unsigned int,unsigned char[]);}int main(){    for(unsigned int i = 0; i < NUM_DATA_BYTES; i++)    {        data[i] = rand();    }    printf ("\r\nData in(start):\r\n");    for (unsigned int j = 0; j < 4; j++)    {        for (unsigned int i = 0; i < DISPLAY_WIDTH; i++)        {            printf ("0x%02x,",data[i+(j*DISPLAY_WIDTH)]);        }        printf ("\r\n");    }    printf ("\r\nNumber of 32-byte chunks to convert: %d\r\n",(unsigned int)ceil(NUM_DATA_BYTES/32.0));    double start_time = omp_get_wtime();    bitflipbyte(data,(unsigned int)ceil(NUM_DATA_BYTES/32.0),k1);    double end_time = omp_get_wtime();    printf ("\r\nData out:\r\n");    for (unsigned int j = 0; j < 4; j++)    {        for (unsigned int i = 0; i < DISPLAY_WIDTH; i++)        {            printf ("0x%02x,",data[i+(j*DISPLAY_WIDTH)]);        }        printf ("\r\n");    }    printf("\r\n\r\nTime to bitflip %d bytes: %f seconds\r\n\r\n",NUM_DATA_BYTES, end_time-start_time);    // return with no errors    return 0;}

The printf's are for debugging..

Here is the workhorse:

bits 64global bitflipbytebitflipbyte:            vmovdqa     ymm2, [rdx]        add         rdx, 20h        vmovdqa     ymm3, [rdx]        add         rdx, 20h        vmovdqa     ymm4, [rdx]bitflipp_loop:        vmovdqa     ymm0, [rdi]         vpand       ymm1, ymm2, ymm0         vpandn      ymm0, ymm2, ymm0         vpsrld      ymm0, ymm0, 4h         vpshufb     ymm1, ymm4, ymm1         vpshufb     ymm0, ymm3, ymm0                 vpor        ymm0, ymm0, ymm1        vmovdqa     [rdi], ymm0        add     rdi, 20h        dec     rsi        jnz     bitflipp_loop        ret

The code takes 32 bytes then masks out the nibbles. The high nibble gets shifted right by 4. Then I use vpshufb and ymm4 / ymm3 as lookup tables. I could use a single lookup table but then I would have to shift left before ORing the nibbles together again.

There are even faster ways of flipping the bits. But I'm bound to single thread and CPU so this was the fastest I could achieve. Can you make a faster version?

Please make no comments about using the Intel C/C++ Compiler Intrinsic Equivalent commands...


Well this certainly won't be an answer like Matt J's but hopefully it will still be useful.

size_t reverse(size_t n, unsigned int bytes){    __asm__("BSWAP %0" : "=r"(n) : "0"(n));    n >>= ((sizeof(size_t) - bytes) * 8);    n = ((n & 0xaaaaaaaaaaaaaaaa) >> 1) | ((n & 0x5555555555555555) << 1);    n = ((n & 0xcccccccccccccccc) >> 2) | ((n & 0x3333333333333333) << 2);    n = ((n & 0xf0f0f0f0f0f0f0f0) >> 4) | ((n & 0x0f0f0f0f0f0f0f0f) << 4);    return n;}

This is exactly the same idea as Matt's best algorithm except that there's this little instruction called BSWAP which swaps the bytes (not the bits) of a 64-bit number. So b7,b6,b5,b4,b3,b2,b1,b0 becomes b0,b1,b2,b3,b4,b5,b6,b7. Since we are working with a 32-bit number we need to shift our byte-swapped number down 32 bits. This just leaves us with the task of swapping the 8 bits of each byte which is done and voila! we're done.

Timing: on my machine, Matt's algorithm ran in ~0.52 seconds per trial. Mine ran in about 0.42 seconds per trial. 20% faster is not bad I think.

If you're worried about the availability of the instruction BSWAP Wikipedia lists the instruction BSWAP as being added with 80846 which came out in 1989. It should be noted that Wikipedia also states that this instruction only works on 32 bit registers which is clearly not the case on my machine, it very much works only on 64-bit registers.

This method will work equally well for any integral datatype so the method can be generalized trivially by passing the number of bytes desired:

    size_t reverse(size_t n, unsigned int bytes)    {        __asm__("BSWAP %0" : "=r"(n) : "0"(n));        n >>= ((sizeof(size_t) - bytes) * 8);        n = ((n & 0xaaaaaaaaaaaaaaaa) >> 1) | ((n & 0x5555555555555555) << 1);        n = ((n & 0xcccccccccccccccc) >> 2) | ((n & 0x3333333333333333) << 2);        n = ((n & 0xf0f0f0f0f0f0f0f0) >> 4) | ((n & 0x0f0f0f0f0f0f0f0f) << 4);        return n;    }

which can then be called like:

    n = reverse(n, sizeof(char));//only reverse 8 bits    n = reverse(n, sizeof(short));//reverse 16 bits    n = reverse(n, sizeof(int));//reverse 32 bits    n = reverse(n, sizeof(size_t));//reverse 64 bits

The compiler should be able to optimize the extra parameter away (assuming the compiler inlines the function) and for the sizeof(size_t) case the right-shift would be removed completely. Note that GCC at least is not able to remove the BSWAP and right-shift if passed sizeof(char).