Will multi threading provide any performance boost? Will multi threading provide any performance boost? multithreading multithreading

Will multi threading provide any performance boost?


Multithreading across multiple cores could reduce the time required to sum across the axes, but special care is required. You might actually get larger performance boosts from some changes you could make to your single thread code:

  1. You only need as many threads to match the number of cores available to you. This is a CPU intensive operation, and threads are unlikely to be waiting for I/O.

  2. The above assumption might not hold if the entire array does not fit in RAM. If portions of the array are paged in and out, some threads will be waiting for paging operations to complete. In that case, the program might benefit from having more threads than cores. Too many, however, and performance will drop due to the cost of context switching. You might have to experiment with the thread count. The general rule is to minimize the number of context switches between ready threads.

  3. If the entire array does not fit in RAM, you want to minimize paging! The order in which each thread accesses memory matters, as does the memory access pattern of all the running threads. To the extent possible, you would want to finish up with one part of the array before moving to the next, never to return to a covered area.

  4. Each core would benefit from having to access a completely separate region of memory. You want to avoid memory access delays caused by locks and bus contention. At least for one dimension of the cube, that should be straightforward: set each thread with its own portion of the cube.

  5. Each core would also benefit from accessing more data from its cache(s), as opposed to fetching from RAM. That would mean ordering the loops such that inner loops access nearby words, rather than skipping across rows.

  6. Finally, depending on the data types in the array, the SIMD instructions of Intel/AMD processors (SSE, at their various generations) can help accelerate single core performance by summing multiple cells at once. VC++ has some built in support.

  7. If you have to prioritize your work, you might want to first minimize disk paging, then concentrate on optimizing memory access to make use of the CPU caches, and only then deal with multithreading.


There is only one way to optimize code: figure out what you're doing that's slow, and do less of it. A special case of "doing less of it" is to do something else instead that's faster.

So first of all, here's what I'm doing based on your posted code:

#include <fstream>#include <sstream>using std::ios_base;template<typename Iterator, typename Value>void iota(Iterator start, Iterator end, Value val) {    while (start != end) {        *(start++) = val++;    }}int main() {    const int dim = 1000;    const int cubesize = dim*dim*dim;    const int squaresize = dim*dim;    const int steps = 7; //ranges from 1 to  255    typedef unsigned char uchar;    uchar *partMap = new uchar[cubesize];    // dummy data. I timed this separately and it takes about    // a second, so I won't worry about its effect on overall timings.    iota(partMap, partMap + cubesize, uchar(7));    uchar *projection = new uchar[squaresize];    for (int stage = 1; stage < steps; stage++) {        for (int j = 0; j < dim; j++) {                for (int i = 0; i < dim; i++)                {                        int sum = 0;                        for (int k = 0; k < dim; k++)                            if (partMap[(((i * dim) + k) * dim) + j] >= stage)                                sum++;                        projection[(j*dim) + i] = sum;                }        }        std::stringstream filename;        filename << "results" << stage << ".bin";        std::ofstream file(filename.str().c_str(),             ios_base::out | ios_base::binary | ios_base::trunc);        file.write((char *)projection, squaresize);    }    delete[] projection;    delete[] partMap;}

(Edit: just noticed that "projection" should be an array of int, not uchar. My bad. This will make a difference to some of the timings, but hopefully not too big of one.)

Then I copied result*.bin to gold*.bin, so I can check my future changes as follows:

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 56; do diff -q results$n.bin gold$n.bin; doneg++  -O3 -pedantic -Wall   big.cpp   -o bigreal    1m41.978suser    1m39.450ssys     0m0.451s

OK, so 100 seconds at the moment.

So, speculating that it's striding through the billion-item data array that's slow, let's try only going through once, instead of once per stage:

    uchar *projections[steps];    for (int stage = 1; stage < steps; stage++) {         projections[stage] = new uchar[squaresize];    }    for (int j = 0; j < dim; j++) {            for (int i = 0; i < dim; i++)            {                    int counts[256] = {0};                    for (int k = 0; k < dim; k++)                            counts[partMap[(((i * dim) + k) * dim) + j]]++;                    int sum = 0;                    for (int idx = 255; idx >= steps; --idx) {                        sum += counts[idx];                    }                    for (int stage = steps-1; stage > 0; --stage) {                        sum += counts[stage];                        projections[stage][(j*dim) + i] = sum;                    }            }    }    for (int stage = 1; stage < steps; stage++) {        std::stringstream filename;        filename << "results" << stage << ".bin";        std::ofstream file(filename.str().c_str(),            ios_base::out | ios_base::binary | ios_base::trunc);        file.write((char *)projections[stage], squaresize);    }    for (int stage = 1; stage < steps; stage++) delete[] projections[stage];    delete[] partMap;

It's a bit faster:

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 56; do diff -q results$n.bin gold$n.bin; doneg++  -O3 -pedantic -Wall   big.cpp   -o bigreal    1m15.176suser    1m13.772ssys     0m0.841s

Now, steps is quite small in this example, so we're doing a lot of unnecessary work with the "counts" array. Without even profiling, I'm guessing that counting to 256 twice (once to clear the array and once to sum it) is quite significant compared with counting to 1000 (to run along our column). So let's change that:

    for (int j = 0; j < dim; j++) {            for (int i = 0; i < dim; i++)            {                    // steps+1, not steps. I got this wrong the first time,                    // which at least proved that my diffs work as a check                    // of the answer...                    int counts[steps+1] = {0};                    for (int k = 0; k < dim; k++) {                        uchar val = partMap[(((i * dim) + k) * dim) + j];                        if (val >= steps)                             counts[steps]++;                        else counts[val]++;                    }                    int sum = counts[steps];                    for (int stage = steps-1; stage > 0; --stage) {                        sum += counts[stage];                        projections[stage][(j*dim) + i] = sum;                    }            }    }

Now we're only using as many buckets as we actually need.

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 56; do diff -q results$n.bin gold$n.bin; doneg++  -O3 -pedantic -Wall   big.cpp   -o bigreal    0m27.643suser    0m26.551ssys     0m0.483s

Hurrah. The code is nearly 4 times as fast as the first version, and produces the same results. All I've done is change what order the maths is done: we haven't even looked at multi-threading or prefetching yet. And I haven't attempted any highly technical loop optimisation, just left it to the compiler. So this can be considered a decent start.

However it's still taking an order of magnitude longer than the 1s which iota runs in. So there are probably big gains still to find. One main difference is that iota runs over the 1d array in sequential order, instead of leaping about all over the place. As I said in my first answer, you should aim to always use sequential order on the cube.

So, let's make a one-line change, switching the i and j loops:

            for (int i = 0; i < dim; i++)    for (int j = 0; j < dim; j++) {

This still isn't sequential order, but it does mean we're focussing on one million-byte slice of our cube at a time. A modern CPU has at least 4MB cache, so with a bit of luck we'll only hit main memory for any given part of the cube once in the entire program. With even better locality we could reduce the traffic in and out of L1 cache, too, but main memory is the slowest.

How much difference does it make?

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 56; do diff -q results$n.bin gold$n.bin; doneg++  -O3 -pedantic -Wall   big.cpp   -o bigreal    0m8.221suser    0m4.507ssys     0m0.514s

Not bad. In fact, this change alone brings the original code from 100s to 20s. So this is responsible for a factor of 5, and everything else I did is responsible for another factor of 5 (I think the difference between 'user' and 'real' time in the above is mostly accounted for by the fact my virus scanner is running, which it wasn't earlier. 'user' is how much time the program occupied a CPU, 'real' includes time spent suspended, either waiting on I/O or giving another process time to run).

Of course, my bucket sort relies on the fact that whatever we're doing with the values in each column is commutative and associative. Reducing the number of buckets only worked because large values are all treated the same. This might not be true for all your operations, so you'll have to look at the inner loop of each one in turn to figure out what to do with it.

And the code is a bit more complicated. Instead of running over the data doing "blah" for each stage, we're computing all the stages at the same time in a single run over the data. If you start doing row and column computations in a single pass, as I recommended in my first answer, this will get worse. You may have to start breaking your code into functions to keep it readable.

Finally, a lot of my performance gain came from an optimisation for the fact that "steps" is small. With steps=100, I get:

$ make big -B CPPFLAGS="-O3 -pedantic -Wall" && time ./big; for n in 1 2 3 4 56; do diff -q results$n.bin gold$n.bin; doneg++  -O3 -pedantic -Wall   big.cpp   -o bigreal    0m22.262suser    0m10.108ssys     0m1.029s

This isn't so bad. With steps=100 the original code probably takes about 1400 seconds, although I'm not going to run it to prove that. But it's worth remembering that I haven't completely taken away the time dependency on "steps", just made it sub-linear.


How does your code work. Does it go like this?

for each row: add up the valuesfor each column: add up the valuesfor each stack: add up the values

If so, you might want to read up on "locality of reference". Depending how your data is stored, you might find that while you're doing the stacks, a whole cache line has to be pulled in for each value, because the values are nowhere near each other in memory. In fact, with a billion values, you could be pulling things all the way from disk. Sequential access with a long stride (distance between values) is the worst possible use for cache. Try profiling, and if you see that adding up the stacks is taking longer than adding up the rows, this is almost certainly why.

I think you could be saturating the memory bus(*), in which case multithreading would only help if core2 quad uses different buses for different cores. But if you're not saturating the bus bandwidth, you can't get best performance this way even once you multi-thread. You'll have 4 cores spending all their time stalled on cache misses instead of one.

If you are memory cache bound, then your goal should be to visit each page/line of memory as few times as possible. So I'd try things like running over the data once, adding each value to three different totals as you go. If that runs faster on a single core, then we're in business. The next step is that with a 1000x1000x1000 cube, you have 3 million totals on the go. That doesn't fit in cache either, so you have to worry about the same cache miss problems writing as you do reading.

You want to make sure that as you run along a row of 1000 adjacent values in RAM adding to the row total that they all share, you're also adding to adjacent totals for the columns and stacks (which they don't store). So the "square" of column totals should be stored in the appropriate way, as should the "square" of stacks. That way you deal with 1000 of your billion values just by pulling about 12k of memory into cache (4k for 1000 values, plus 4k for 1000 column totals, plus 4k for 1000 stack totals). As against that, you're doing more stores than you would be by concentrating on 1 total at a time (which therefore could be in a register).

So I don't promise anything, but I think it's worth looking at order of memory access, whether you multi-thread or not. If you can do more CPU work while accessing only a relatively small amount of memory, then you'll speed up the single-threaded version but also put yourself in much better shape for multi-threading, since the cores share a limited cache, memory bus, and main RAM.

(*) Back of envelope calculation: in random random reviews off the internet the highest estimated FSB bandwidth for Core2 processors I've found so far is an Extreme at 12GB/s, with 2 channels at 4x199MHz each). Cache line size is 64 bytes, which is less than your stride. So summing a column or stack the bad way, grabbing 64 bytes per value, would only saturate the bus if it was doing 200 million values per second. I'm guessing it's nothing like this fast (10-15 seconds for the whole thing), or you wouldn't be asking how to speed it up.

So my first guess was probably way off. Unless your compiler or CPU has inserted some very clever pre-fetching, a single core cannot be using 2 channels and 4 simultaneous transfers per cycle. For that matter, 4 cores couldn't use 2 channels and 4 simultaneous transfers. The effective bus bandwidth for a series of requests might be much lower than the physical limit, in which case you would hope to see good improvements from multi-threading simply because you have 4 cores asking for 4 different cache lines, all of which can be loaded simultaneously without troubling the FSB or the cache controller. But the latency is still the killer, and so if you can load less than one cache line per value summed, you'll do much better.