C - have a simple loop that does arithmetic calculation; profiler shows this is a bottleneck. How to speed it up? C - have a simple loop that does arithmetic calculation; profiler shows this is a bottleneck. How to speed it up? c c

C - have a simple loop that does arithmetic calculation; profiler shows this is a bottleneck. How to speed it up?


This appears to be a cache issue, not an arithmetic one.

for (; j >= 0; --j) {    ...    ... coef[j];}

You're accessing an array here, and you are decrementing an index to do so. This action can really disrupt the cache-friendly locality inherent in a simple loop.

Is it possible to count forward? Ie,

for (int i = 0; i <= j; i++) {    ...    ... coef[i];}

Would your calculation be valid?


First thing I would try would be to iterate in steps of 4, ie: j+=4 (or in your case, j -=4) and semi-unroll the loop. The reason for this is it will help the compiler to make SSE optimisations and to batch memory access from main memory to cache. Just be aware that you will have to cater for the last few elements in case the loop count is not divisible by 4. For example:

// Disclaimer: I have not tested this code!for (; j >= 3; j -= 4) {                  brp2 = brpp;                          brpp = br;                            br = x2 * brpp - brp2 + coef[j];     brp2 = brpp;                          brpp = br;                            br = x2 * brpp - brp2 + coef[j-1];     brp2 = brpp;                          brpp = br;                            br = x2 * brpp - brp2 + coef[j-2];     brp2 = brpp;                          brpp = br;                            br = x2 * brpp - brp2 + coef[j-3]; }                          // if (j % 4) != 0 before the loop operation, // handle 1, 2 or 3 remaining elements here

Second thing I would try would be to preload coeff[j] into a register immediate prior to the calculation. The reason for this is floating point calculations are pipelined, meaning that a memory access in the wrong place can have adverse effects on performance. The calculation itself can be very fast but might take 14 instructions just to queue up the data from cache into the FPU. Add to that an access from main memory it can get even worse. For instance, try this (could also be tried with and without the -=4 unrolling)

// Disclaimer: I have not tested this code!register double coef1, coef2, coef3, ceof4;for (; j >= 3; j -= 4) {               coef1 = coef[j];    // Preloads the 4 sequential coeffs from     coef2 = coef[j-1];  // main memory to cache (if available)    coef3 = coef[j-2];      coef4 = coef[j-3];      brp2 = brpp;                          brpp = br;                            br = x2 * brpp - brp2 + coef1;     brp2 = brpp;                          brpp = br;                            br = x2 * brpp - brp2 + coef2;     brp2 = brpp;                          brpp = br;                            br = x2 * brpp - brp2 + coef3;     brp2 = brpp;                          brpp = br;                            br = x2 * brpp - brp2 + coef4; } 

In this case the variables double x2, br, brp2, brpp, coef1, coef2, coef3, coef4 should be registers if at all possible.

Finally, using the above, can you apply SSE/SSE2 optimisation to it? Make sure this is enabled in the GCC compiler (I'm used to VS so the equivalent would be Release mode on, debug symbols off, optimization on, SSE2 on) and benchmark your code without the debugger attached. This alone can have a dramatic affect on performance.

Let us know the results. Performance tuning is trial and error!

Best of luck,


I've never tried using registers or other fancy tricks to speed things up before. Anyone think something like that can work here?

There's a very easy register trick that anyone can do. Build the project for a recent CPU. Does this code need to run on a computer from 1995? 2000? 2005? If the program can count on a newer CPU, it can count on having more registers at its disposal.

Also, The integer indexing is unnecessary. You could instead make j a pointer directly to the double of interest. This may make a difference if your optimizing compiler isn't already doing it.

double swi_echeb(const double x, const double* const coef, const int ncf){    const double *j = &coef[ncf - 1];    // (stuff...)    while (true) {         // (stuff...)        br = x2 * brpp - brp2 + *j;        if ( j == coef )            break;        --j;    }  }