Efficient layout and reduction of virtual 2d data (abstract) Efficient layout and reduction of virtual 2d data (abstract) c c

Efficient layout and reduction of virtual 2d data (abstract)


First, I was puzzled for a moment why (N**2 - N)/2 yielded 27 for N=7 ... but for indices 0-7, N=8, and there are 28 elements in P. Shouldn't try to answer questions like this so late in the day. :-)

But on to a potential solution: Do you need to keep the array P for any other purpose? If not, I think you can get the result you want with just two intermediate arrays, each of length N: one for the sum of the rows and one for the sum of the columns.

Here's a quick-and-dirty example of what I think you're trying to do (subroutine direct_approach()) and how to achieve the same result using the intermediate arrays (subroutine refined_approach()):

#include <cstdlib>#include <cstdio>const int N = 7;const float input_values[N] = { 3.0F, 5.0F, 7.0F, 11.0F, 13.0F, 17.0F, 23.0F };float P[N][N];      // Yes, I'm wasting half the array.  This way I don't have to fuss with mapping the indices.float result1[N] = { 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F };float result2[N] = { 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F, 0.0F };float f(float arg1, float arg2){    // Arbitrary computation    return (arg1 * arg2);}float compute_result(int index){    float row_sum = 0.0F;    float col_sum = 0.0F;    int row;    int col;    // Compute the row sum    for (col = (index + 1); col < N; col++)    {        row_sum += P[index][col];    }    // Compute the column sum    for (row = 0; row < index; row++)    {        col_sum += P[row][index];    }    return (row_sum - col_sum);}void direct_approach(){    int row;    int col;    for (row = 0; row < N; row++)    {        for (col = (row + 1); col < N; col++)        {            P[row][col] = f(input_values[row], input_values[col]);        }    }    int index;    for (index = 0; index < N; index++)    {        result1[index] = compute_result(index);    }}void refined_approach(){    float row_sums[N];    float col_sums[N];    int index;    // Initialize intermediate arrays    for (index = 0; index < N; index++)    {        row_sums[index] = 0.0F;        col_sums[index] = 0.0F;    }    // Compute the row and column sums    // This can be parallelized by computing row and column sums    //  independently, instead of in nested loops.    int row;    int col;    for (row = 0; row < N; row++)    {        for (col = (row + 1); col < N; col++)        {            float computed = f(input_values[row], input_values[col]);            row_sums[row] += computed;            col_sums[col] += computed;        }    }    // Compute the result    for (index = 0; index < N; index++)    {        result2[index] = row_sums[index] - col_sums[index];    }}void print_result(int n, float * result){    int index;    for (index = 0; index < n; index++)    {        printf("  [%d]=%f\n", index, result[index]);    }}int main(int argc, char * * argv){    printf("Data reduction test\n");    direct_approach();    printf("Result 1:\n");    print_result(N, result1);    refined_approach();    printf("Result 2:\n");    print_result(N, result2);    return (0);}

Parallelizing the computation is not so easy, since each intermediate value is a function of most of the inputs. You can compute the sums individually, but that would mean performing f(...) multiple times. The best suggestion I can think of for very large values of N is to use more intermediate arrays, computing subsets of the results, then summing the partial arrays to yield the final sums. I'd have to think about that one when I'm not so tired.

To cope with the skip issue: If it's a simple matter of "don't use input values x, y, and z", you can store x, y, and z in a do_not_use array and check for those values when looping to compute the sums. If the values to be skipped are some function of row and column, you can store those as pairs and check for the pairs.

Hope this gives you ideas for your solution!

Update, now that I'm awake: Dealing with "skip" depends a lot on what data needs to be skipped. Another possibility for the first case - "don't use input values x, y, and z" - a much faster solution for large data sets would be to add a level of indirection: create yet another array, this one of integer indices, and store only the indices of the good inputs. F'r instance, if invalid data is in inputs 2 and 5, the valid array would be:

int valid_indices[] = { 0, 1, 3, 4, 6 };

Interate over the array valid_indices, and use those indices to retrieve the data from your input array to compute the result. On the other paw, if the values to skip depend on both indices of the P array, I don't see how you can avoid some kind of lookup.

Back to parallelizing - No matter what, you'll be dealing with (N**2 - N)/2 computationsof f(). One possibility is to just accept that there will be contention for the sumarrays, which would not be a big issue if computing f() takes substantially longer thanthe two additions. When you get to very large numbers of parallel paths, contention willagain be an issue, but there should be a "sweet spot" balancing the number of parallelpaths against the time required to compute f().

If contention is still an issue, you can partition the problem several ways. One way isto compute a row or column at a time: for a row at a time, each column sum can becomputed independently and a running total can be kept for each row sum.

Another approach would be to divide the data space and, thus, the computation intosubsets, where each subset has its own row and column sum arrays. After each blockis computed, the independent arrays can then be summed to produce the values you needto compute the result.


This probably will be one of those naive and useless answers, but it also might help. Feel free to tell me that I'm utterly and completely wrong and I have misunderstood the whole affair.

So... here we go!

The Basic Problem

It seems to me that you can define you result function a little differently and it will lift at least some contention off your intermediate values. Let's suppose that your P matrix is lower-triangular. If you (virtually) fill the upper triangle with the negative of the lower values (and the main diagonal with all zeros,) then you can redefine each element of your result as the sum of a single row: (shown here for N=4, and where -i means the negative of the value in the cell marked as i)

 P     0    1    2    3    |--------------------  0|   x   -0   -1   -3   |  1|   0    x   -2   -4   |  2|   1    2    x   -5   |  3|   3    4    5    x

If you launch independent threads (executing the same kernel) to calculate the sum of each row of this matrix, each thread will write a single result element. It seems that your problem size is large enough to saturate your hardware threads and keep them busy.

The caveat, of course, is that you'll be calculating each f(x, y) twice. I don't know how expensive that is, or how much the memory contention was costing you before, so I cannot judge whether this is a worthwhile trade-off to do or not. But unless f was really really expensive, I think it might be.

Skipping Values

You mention that you might have tens of thousands elements of the P matrix that you need to ignore in your calculations (effectively skip them.)

To work with the scheme I've proposed above, I believe you should store the skipped elements as (row, col) pairs, and you have to add the transposed of each coordinate pair too (so you'll have twice the number of skipped values.) So your example skip list of P[6], P[14] and P[18] becomes P(4,0), P(5,4), P(6,3) which then becomes P(4,0), P(5,4), P(6,3), P(0,4), P(4,5), P(3,6).

Then you sort this list, first based on row and then column. This makes our list to be P(0,4), P(3,6), P(4,0), P(4,5), P(5,4), P(6,3).

If each row of your virtual P matrix is processed by one thread (or a single instance of your kernel or whatever,) you can pass it the values it needs to skip. Personally, I would store all these in a big 1D array and just pass in the first and last index that each thread would need to look at (I would also not store the row indices in the final array that I passed in, since it can be implicitly inferred, but I think that's obvious.) In the example above, for N = 8, the begin and end pairs passed to each thread will be: (note that the end is one past the final value needed to be processed, just like STL, so an empty list is denoted by begin == end)

Thread 0: 0..1Thread 1: 1..1 (or 0..0 or whatever)Thread 2: 1..1Thread 3: 1..2Thread 4: 2..4Thread 5: 4..5Thread 6: 5..6Thread 7: 6..6

Now, each thread goes on to calculate and sum all the intermediate values in a row. While it is stepping through the indices of columns, it is also stepping through this list of skipped values and skipping any column number that comes up in the list. This is obviously an efficient and simple operation (since the list is sorted by column too. It's like merging.)

Pseudo-Implementation

I don't know CUDA, but I have some experience working with OpenCL, and I imagine the interfaces are similar (since the hardware they are targeting are the same.) Here's an implementation of the kernel that does the processing for a row (i.e. calculates one entry of result) in pseudo-C++:

double calc_one_result (    unsigned my_id, unsigned N, double const init_values [],    unsigned skip_indices [], unsigned skip_begin, unsigned skip_end){    double res = 0;    for (unsigned col = 0; col < my_id; ++col)        // "f" seems to take init_values[column] as its first arg        res += f (init_values[col], init_values[my_id]);    for (unsigned row = my_id + 1; row < N; ++row)        res -= f (init_values[my_id], init_values[row]);    // At this point, "res" is holding "result[my_id]",    // including the values that should have been skipped        unsigned i = skip_begin;    // The second condition is to check whether we have reached the    // middle of the virtual matrix or not    for (; i < skip_end && skip_indices[i] < my_id; ++i)    {        unsigned col = skip_indices[i];        res -= f (init_values[col], init_values[my_id]);    }    for (; i < skip_end; ++i)    {        unsigned row = skip_indices[i];        res += f (init_values[my_id], init_values[row]);    }        return res;}

Note the following:

  1. The semantics of init_values and function f are as described by the question.

  2. This function calculates one entry in the result array; specifically, it calculates result[my_id], so you should launch N instances of this.

  3. The only shared variable it writes to is result[my_id]. Well, the above function doesn't write to anything, but if you translate it to CUDA, I imagine you'd have to write to that at the end. However, no one else writes to that particular element of result, so this write will not cause any contention of data race.

  4. The two input arrays, init_values and skipped_indices are shared among all the running instances of this function.

  5. All accesses to data are linear and sequential, except for the skipped values, which I believe is unavoidable.

  6. skipped_indices contain a list of indices that should be skipped in each row. It's contents and structure are as described above, with one small optimization. Since there was no need, I have removed the row numbers and left only the columns. The row number will be passed into the function as my_id anyways and the slice of the skipped_indices array that should be used by each invocation is determined using skip_begin and skip_end.

    For the example above, the array that is passed into all invocations of calc_one_result will look like this:[4, 6, 0, 5, 4, 3].

  7. As you can see, apart from the loops, the only conditional branch in this code is skip_indices[i] < my_id in the third for-loop. Although I believe this is innocuous and totally predictable, even this branch can be easily avoided in the code. We just need to pass in another parameter called skip_middle that tells us where the skipped items cross the main diagonal (i.e. for row #my_id, the index at skipped_indices[skip_middle] is the first that is larger than my_id.)

In Conclusion

I'm by no means an expert in CUDA and HPC. But if I have understood your problem correctly, I think this method might eliminate any and all contentions for memory. Also, I don't think this will cause any (more) numerical stability issues.

The cost of implementing this is:

  • Calling f twice as many times in total (and keeping track of when it is called for row < col so you can multiply the result by -1.)
  • Storing twice as many items in the list of skipped values. Since the size of this list is in the thousands (and not billions!) it shouldn't be much of a problem.
  • Sorting the list of skipped values; which again due to its size, should be no problem.

(UPDATE: Added the Pseudo-Implementation section.)