SSE instructions to add all elements of an array [duplicate] SSE instructions to add all elements of an array [duplicate] arrays arrays

SSE instructions to add all elements of an array [duplicate]


If you just want to sum all the elements of an array then you need to load the data, unpack it to a wider element size, and then sum the unpacked elements. Note that you can maintain multiple partial sums until after the loop and then just do one final sum of these partial sums. For example:

uint32_t sum_array(const uint8_t a[], int n){    const __m128i vk0 = _mm_set1_epi8(0);       // constant vector of all 0s for use with _mm_unpacklo_epi8/_mm_unpackhi_epi8    const __m128i vk1 = _mm_set1_epi16(1);      // constant vector of all 1s for use with _mm_madd_epi16    __m128i vsum = _mm_set1_epi32(0);           // initialise vector of four partial 32 bit sums    uint32_t sum;    int i;    for (i = 0; i < n; i += 16)    {        __m128i v = _mm_load_si128(&a[i]);      // load vector of 8 bit values        __m128i vl = _mm_unpacklo_epi8(v, vk0); // unpack to two vectors of 16 bit values        __m128i vh = _mm_unpackhi_epi8(v, vk0);        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vl, vk1));        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vh, vk1));                                                // unpack and accumulate 16 bit values to                                                // 32 bit partial sum vector    }    // horizontal add of four 32 bit partial sums and return result    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));    sum = _mm_cvtsi128_si32(vsum);    return sum;}

Note that there is one non-obvious trick in the above code - rather than further unpacking each 16 bit vector to a pair of 32 bit vectors (requiring 4 unpack instructions) and then using four 32 bit adds (another 4 instructions), we use _mm_madd_epi16 (PMADDWD) with a multiplicand of 1 and _mm_add_epi32 to effectively give us free unpacking, so we get the same result using 4 instructions instead of 8.

Note also that the input array, a[], needs to be 16 byte aligned, and n should be a multiple of 16.