Take advantage of ARM unaligned memory access while writing clean C code Take advantage of ARM unaligned memory access while writing clean C code c c

Take advantage of ARM unaligned memory access while writing clean C code


OK, the situation is more confusing than one would like. So, in an effort to clarify, here are the findings on this journey :

accessing unaligned memory

  1. The only portable C standard solution to access unaligned memory is the memcpy one. I was hoping to get another one through this question, but apparently it's the only one found so far.

Example code :

u32 read32(const void* ptr)  {     u32 value;     memcpy(&value, ptr, sizeof(value));     return value;  }

This solution is safe in all circumstances. It also compiles into a trivial load register operation on x86 target using GCC.

However, on ARM target using GCC, it translates into a way too large and useless assembly sequence, which bogs down performance.

Using Clang on ARM target, memcpy works fine (see @notlikethat comment below). It would be easy to blame GCC at large, but it's not that simple : the memcpy solution works fine on GCC with x86/x64, PPC and ARM64 targets. Lastly, trying another compiler, icc13, the memcpy version is surprisingly heavier on x86/x64 (4 instructions, while one should be enough). And that's just the combinations I could test so far.

I have to thank godbolt's project to make such statements easy to observe.

  1. The second solution is to use __packed structures. This solution is not C standard, and entirely depends on compiler's extension. As a consequence, the way to write it depends on the compiler, and sometimes on its version. This is a mess for maintenance of portable code.

That being said, in most circumstances, it leads to better code generation than memcpy. In most circumstances only ...

For example, regarding the above cases where memcpy solution does not work, here are the findings :

  • on x86 with ICC : __packed solution works
  • on ARMv7 with GCC : __packed solution works
  • on ARMv6 with GCC : does not work. Assembly looks even uglier than memcpy.

    1. The last solution is to use direct u32 access to unaligned memory positions. This solution used to work for decades on x86 cpus, but is not recommended, as it violates some C standard principles : compiler is authorized to consider this statement as a guarantee that data is properly aligned, leading to buggy code generation.

Unfortunately, in at least one case, it is the only solution able to extract performance from target. Namely for GCC on ARMv6.

Do not use this solution for ARMv7 though : GCC can generate instructions which are reserved for aligned memory accesses, namely LDM (Load Multiple), leading to crash.

Even on x86/x64, it becomes dangerous to write your code this way nowadays, as the new generation compilers may try to auto-vectorize some compatible loops, generating SSE/AVX code based on the assumption that these memory positions are properly aligned, crashing the program.

As a recap, here are the results summarized as a table, using the convention : memcpy > packed > direct.

| compiler  | x86/x64 | ARMv7  | ARMv6  | ARM64  |  PPC   ||-----------|---------|--------|--------|--------|--------|| GCC 4.8   | memcpy  | packed | direct | memcpy | memcpy || clang 3.6 | memcpy  | memcpy | memcpy | memcpy |   ?    || icc 13    | packed  | N/A    | N/A    | N/A    | N/A    |


Part of the issue is likely that you are not allowing for easy inlinability and further optimization. Having a specialized function for the load means that a function call may be emitted upon each call, which could reduce the performance.

One thing you might do is use static inline, which will allow the compiler to inline the function load32(), thus increasing performance. However, at higher levels of optimization, the compiler should already be inlining this for you.

If the compiler inlines a 4 byte memcpy, it will likely transform it into the most efficient series of loads or stores that will still work on unaligned boundaries. Therefore, if you are still seeing low performance even with compiler optimizations enabled, it may be so that that is the maximum performance for unaligned reads and writes on the processors you are using. Since you said "__packed instructions" are yielding identical performance to memcpy(), this would seem to be the case.


At this point, there is very little that you can do except to align your data. However, if you are dealing with a contiguous array of unaligned u32's, there is one thing you could do:

#include <stdint.h>#include <stdlib.h>// get array of aligned u32uint32_t *align32 (const void *p, size_t n) {    uint32_t *r = malloc (n * sizeof (uint32_t));    if (r)        memcpy (r, p, n);    return r;}

This just uses allocates a new array using malloc(), because malloc() and friends allocate memory with correct alignment for everything:

The malloc() and calloc() functions return a pointer to the allocated memory that is suitably aligned for any kind of variable.

- malloc(3), Linux Programmer's Manual

This should be relatively fast, as you should only have to do this once per set of data. Also, while copying it, memcpy() will be able to adjust only for the initial lack of alignment and then use the fastest aligned load and store instructions available, after which you will be able to deal with your data using the normal aligned reads and writes at full performance.