Fast method to copy memory with translation - ARGB to BGR

c x86 rgb sse micro-optimization

I wrote 4 different versions which work by swapping bytes. I compiled them using gcc 4.2.1 with -O3 -mssse3, ran them 10 times over 32MB of random data and found the averages.

Editor's note: the original inline asm used unsafe constraints, e.g. modifying input-only operands, and not telling the compiler about the side effect on memory pointed-to by pointer inputs in registers. Apparently this worked ok for the benchmark. I fixed the constraints to be properly safe for all callers. This should not affect benchmark numbers, only make sure the surrounding code is safe for all callers. Modern CPUs with higher memory bandwidth should see a bigger speedup for SIMD over 4-byte-at-a-time scalar, but the biggest benefits are when data is hot in cache (work in smaller blocks, or on smaller total sizes).

In 2020, your best bet is to use the portable _mm_loadu_si128 intrinsics version that will compile to an equivalent asm loop: https://gcc.gnu.org/wiki/DontUseInlineAsm.

Also note that all of these over-write 1 (scalar) or 4 (SIMD) bytes past the end of the output, so do the last 3 bytes separately if that's a problem.

--- @PeterCordes

The first version uses a C loop to convert each pixel separately, using the OSSwapInt32 function (which compiles to a bswap instruction with -O3).

void swap1(ARGB *orig, BGR *dest, unsigned imageSize) {    unsigned x;    for(x = 0; x < imageSize; x++) {        *((uint32_t*)(((uint8_t*)dest)+x*3)) = OSSwapInt32(((uint32_t*)orig)[x]);        // warning: strict-aliasing UB.  Use memcpy for unaligned loads/stores    }}

The second method performs the same operation, but uses an inline assembly loop instead of a C loop.

void swap2(ARGB *orig, BGR *dest, unsigned imageSize) {    asm volatile ( // has to be volatile because the output is a side effect on pointed-to memory        "0:\n\t"                   // do {        "movl   (%1),%%eax\n\t"        "bswapl %%eax\n\t"        "movl   %%eax,(%0)\n\t"    // copy a dword byte-reversed        "add    $4,%1\n\t"         // orig += 4 bytes        "add    $3,%0\n\t"         // dest += 3 bytes        "dec    %2\n\t"        "jnz    0b"                // }while(--imageSize)        : "+r" (dest), "+r" (orig), "+r" (imageSize)        : // no pure inputs; the asm modifies and dereferences the inputs to use them as read/write outputs.        : "flags", "eax", "memory"    );}

The third version is a modified version of just a poseur's answer. I converted the built-in functions to the GCC equivalents and used the lddqu built-in function so that the input argument doesn't need to be aligned. (Editor's note: only P4 ever benefited from lddqu; it's fine to use movdqu but there's no downside.)

typedef char v16qi __attribute__ ((vector_size (16)));void swap3(uint8_t *orig, uint8_t *dest, size_t imagesize) {    v16qi mask = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};    uint8_t *end = orig + imagesize * 4;    for (; orig != end; orig += 16, dest += 12) {        __builtin_ia32_storedqu(dest,__builtin_ia32_pshufb128(__builtin_ia32_lddqu(orig),mask));    }}

Finally, the fourth version is the inline assembly equivalent of the third.

void swap2_2(uint8_t *orig, uint8_t *dest, size_t imagesize) {    static const int8_t mask[16] = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};    asm volatile (        "lddqu  %3,%%xmm1\n\t"        "0:\n\t"        "lddqu  (%1),%%xmm0\n\t"        "pshufb %%xmm1,%%xmm0\n\t"        "movdqu %%xmm0,(%0)\n\t"        "add    $16,%1\n\t"        "add    $12,%0\n\t"        "sub    $4,%2\n\t"        "jnz    0b"        : "+r" (dest), "+r" (orig), "+r" (imagesize)        : "m" (mask)  // whole array as a memory operand.  "x" would get the compiler to load it        : "flags", "xmm0", "xmm1", "memory"    );}

(These all compile fine with GCC9.3, but clang10 doesn't know __builtin_ia32_pshufb128; use _mm_shuffle_epi8.)

On my 2010 MacBook Pro, 2.4 Ghz i5 (Westmere/Arrandale), 4GB RAM, these were the average times for each:

Version 1: 10.8630 millisecondsVersion 2: 11.3254 millisecondsVersion 3:  9.3163 millisecondsVersion 4:  9.3584 milliseconds

As you can see, the compiler is good enough at optimization that you don't need to write assembly. Also, the vector functions were only 1.5 milliseconds faster on 32MB of data, so it won't cause much harm if you want to support the earliest Intel macs, which didn't support SSSE3.

Edit: liori asked for standard deviation information. Unfortunately, I hadn't saved the data points, so I ran another test with 25 iterations.

              Average    | Standard DeviationBrute force: 18.01956 ms | 1.22980 ms (6.8%)Version 1:   11.13120 ms | 0.81076 ms (7.3%)Version 2:   11.27092 ms | 0.66209 ms (5.9%)Version 3:    9.29184 ms | 0.27851 ms (3.0%)Version 4:    9.40948 ms | 0.32702 ms (3.5%)

Also, here is the raw data from the new tests, in case anyone wants it. For each iteration, a 32MB data set was randomly generated and run through the four functions. The runtime of each function in microseconds is listed below.

Brute force: 22173 18344 17458 17277 17508 19844 17093 17116 19758 17395 18393 17075 17499 19023 19875 17203 16996 17442 17458 17073 17043 18567 17285 17746 17845Version 1:   10508 11042 13432 11892 12577 10587 11281 11912 12500 10601 10551 10444 11655 10421 11285 10554 10334 10452 10490 10554 10419 11458 11682 11048 10601Version 2:   10623 12797 13173 11130 11218 11433 11621 10793 11026 10635 11042 11328 12782 10943 10693 10755 11547 11028 10972 10811 11152 11143 11240 10952 10936Version 3:    9036  9619  9341  8970  9453  9758  9043 10114  9243  9027  9163  9176  9168  9122  9514  9049  9161  9086  9064  9604  9178  9233  9301  9717  9156Version 4:    9339 10119  9846  9217  9526  9182  9145 10286  9051  9614  9249  9653  9799  9270  9173  9103  9132  9550  9147  9157  9199  9113  9699  9354  9314

c x86 rgb sse micro-optimization

The obvious, using pshufb.

#include <assert.h>#include <inttypes.h>#include <tmmintrin.h>// needs:// orig is 16-byte aligned// imagesize is a multiple of 4// dest has 4 trailing scratch bytesvoid convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {    assert((uintptr_t)orig % 16 == 0);    assert(imagesize % 4 == 0);    __m128i mask = _mm_set_epi8(-128, -128, -128, -128, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3);    uint8_t *end = orig + imagesize * 4;    for (; orig != end; orig += 16, dest += 12) {        _mm_storeu_si128((__m128i *)dest, _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), mask));    }}

c x86 rgb sse micro-optimization

Combining just a poseur's and Jitamaro's answers, if you assume that the inputs and outputs are 16-byte aligned and if you process pixels 4 at a time, you can use a combination of shuffles, masks, ands, and ors to store out using aligned stores. The main idea is to generate four intermediate data sets, then or them together with masks to select the relevant pixel values and write out 3 16-byte sets of pixel data. Note that I did not compile this or try to run it at all.

EDIT2: More detail about the underlying code structure:

With SSE2, you get better performance with 16-byte aligned reads and writes of 16 bytes. Since your 3 byte pixel is only alignable to 16-bytes for every 16 pixels, we batch up 16 pixels at a time using a combination of shuffles and masks and ors of 16 input pixels at a time.

From LSB to MSB, the inputs look like this, ignoring the specific components:

s[0]: 0000 0000 0000 0000s[1]: 1111 1111 1111 1111s[2]: 2222 2222 2222 2222s[3]: 3333 3333 3333 3333

and the ouptuts look like this:

d[0]: 000 000 000 000 111 1d[1]:  11 111 111 222 222 22d[2]:   2 222 333 333 333 333

So to generate those outputs, you need to do the following (I will specify the actual transformations later):

d[0]= combine_0(f_0_low(s[0]), f_0_high(s[1]))d[1]= combine_1(f_1_low(s[1]), f_1_high(s[2]))d[2]= combine_2(f_1_low(s[2]), f_1_high(s[3]))

Now, what should combine_<x> look like? If we assume that d is merely s compacted together, we can concatenate two s's with a mask and an or:

combine_x(left, right)= (left & mask(x)) | (right & ~mask(x))

where (1 means select the left pixel, 0 means select the right pixel): mask(0)= 111 111 111 111 000 0 mask(1)= 11 111 111 000 000 00 mask(2)= 1 111 000 000 000 000

But the actual transformations (f_<x>_low, f_<x>_high) are actually not that simple. Since we are reversing and removing bytes from the source pixel, the actual transformation is (for the first destination for brevity):

d[0]=     s[0][0].Blue s[0][0].Green s[0][0].Red     s[0][1].Blue s[0][1].Green s[0][1].Red     s[0][2].Blue s[0][2].Green s[0][2].Red     s[0][3].Blue s[0][3].Green s[0][3].Red    s[1][0].Blue s[1][0].Green s[1][0].Red    s[1][1].Blue

If you translate the above into byte offsets from source to dest, you get: d[0]= &s[0]+3 &s[0]+2 &s[0]+1
&s[0]+7 &s[0]+6 &s[0]+5 &s[0]+11 &s[0]+10 &s[0]+9 &s[0]+15 &s[0]+14 &s[0]+13
&s[1]+3 &s[1]+2 &s[1]+1
&s[1]+7

(If you take a look at all the s[0] offsets, they match just a poseur's shuffle mask in reverse order.)

Now, we can generate a shuffle mask to map each source byte to a destination byte (X means we don't care what that value is):

f_0_low=  3 2 1  7 6 5  11 10 9  15 14 13  X X X  Xf_0_high= X X X  X X X   X  X X   X  X  X  3 2 1  7f_1_low=    6 5  11 10 9  15 14 13  X X X   X X X  X  Xf_1_high=   X X   X  X X   X  X  X  3 2 1   7 6 5  11 10f_2_low=      9  15 14 13  X  X  X  X X X   X  X  X  X  X  Xf_2_high=     X   X  X  X  3  2  1  7 6 5   11 10 9  15 14 13

We can further optimize this by looking the masks we use for each source pixel. If you take a look at the shuffle masks that we use for s[1]:

f_0_high=  X  X  X  X  X  X  X  X  X  X  X  X  3  2  1  7f_1_low=   6  5 11 10  9 15 14 13  X  X  X  X  X  X  X  X

Since the two shuffle masks don't overlap, we can combine them and simply mask off the irrelevant pixels in combine_, which we already did! The following code performs all these optimizations (plus it assumes that the source and destination addresses are 16-byte aligned). Also, the masks are written out in code in MSB->LSB order, in case you get confused about the ordering.

EDIT: changed the store to _mm_stream_si128 since you are likely doing a lot of writes and we don't want to necessarily flush the cache. Plus it should be aligned anyway so you get free perf!

#include <assert.h>#include <inttypes.h>#include <tmmintrin.h>// needs:// orig is 16-byte aligned// imagesize is a multiple of 4// dest has 4 trailing scratch bytesvoid convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {    assert((uintptr_t)orig % 16 == 0);    assert(imagesize % 16 == 0);    __m128i shuf0 = _mm_set_epi8(        -128, -128, -128, -128, // top 4 bytes are not used        13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3); // bottom 12 go to the first pixel    __m128i shuf1 = _mm_set_epi8(        7, 1, 2, 3, // top 4 bytes go to the first pixel    -128, -128, -128, -128, // unused        13, 14, 15, 9, 10, 11, 5, 6); // bottom 8 go to second pixel    __m128i shuf2 = _mm_set_epi8(        10, 11, 5, 6, 7, 1, 2, 3, // top 8 go to second pixel    -128, -128, -128, -128, // unused        13, 14, 15, 9); // bottom 4 go to third pixel    __m128i shuf3 = _mm_set_epi8(        13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3, // top 12 go to third pixel        -128, -128, -128, -128); // unused    __m128i mask0 = _mm_set_epi32(0, -1, -1, -1);    __m128i mask1 = _mm_set_epi32(0,  0, -1, -1);    __m128i mask2 = _mm_set_epi32(0,  0,  0, -1);    uint8_t *end = orig + imagesize * 4;    for (; orig != end; orig += 64, dest += 48) {        __m128i a= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), shuf0);        __m128i b= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 1), shuf1);        __m128i c= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 2), shuf2);        __m128i d= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 3), shuf3);        _mm_stream_si128((__m128i *)dest, _mm_or_si128(_mm_and_si128(a, mask0), _mm_andnot_si128(b, mask0));        _mm_stream_si128((__m128i *)dest + 1, _mm_or_si128(_mm_and_si128(b, mask1), _mm_andnot_si128(c, mask1));        _mm_stream_si128((__m128i *)dest + 2, _mm_or_si128(_mm_and_si128(c, mask2), _mm_andnot_si128(d, mask2));    }}

CodeHunter

Fast method to copy memory with translation - ARGB to BGR

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last