Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1? Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1? arrays arrays

Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?


A simple possibility that comes to mind is to keep a compressed array of 2 bits per value for the common cases, and a separated 4 byte per value (24 bit for original element index, 8 bit for actual value, so (idx << 8) | value)) sorted array for the other ones.

When you look up a value, you first do a lookup in the 2bpp array (O(1)); if you find 0, 1 or 2 it's the value you want; if you find 3 it means that you have to look it up in the secondary array. Here you'll perform a binary search to look for the index of your interest left-shifted by 8 (O(log(n) with a small n, as this should be the 1%), and extract the value from the 4-byte thingie.

std::vector<uint8_t> main_arr;std::vector<uint32_t> sec_arr;uint8_t lookup(unsigned idx) {    // extract the 2 bits of our interest from the main array    uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;    // usual (likely) case: value between 0 and 2    if(v != 3) return v;    // bad case: lookup the index<<8 in the secondary array    // lower_bound finds the first >=, so we don't need to mask out the value    auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);#ifdef _DEBUG    // some coherency checks    if(ptr == sec_arr.end()) std::abort();    if((*ptr >> 8) != idx) std::abort();#endif    // extract our 8-bit value from the 32 bit (index, value) thingie    return (*ptr) & 0xff;}void populate(uint8_t *source, size_t size) {    main_arr.clear(); sec_arr.clear();    // size the main storage (round up)    main_arr.resize((size+3)/4);    for(size_t idx = 0; idx < size; ++idx) {        uint8_t in = source[idx];        uint8_t &target = main_arr[idx>>2];        // if the input doesn't fit, cap to 3 and put in secondary storage        if(in >= 3) {            // top 24 bits: index; low 8 bit: value            sec_arr.push_back((idx << 8) | in);            in = 3;        }        // store in the target according to the position        target |= in << ((idx & 3)*2);    }}

For an array such as the one you proposed, this should take 10000000 / 4 = 2500000 bytes for the first array, plus 10000000 * 1% * 4 B = 400000 bytes for the second array; hence 2900000 bytes, i.e. less than one third of the original array, and the most used portion is all kept together in memory, which should be good for caching (it may even fit L3).

If you need more than 24-bit addressing, you'll have to tweak the "secondary storage"; a trivial way to extend it is to have a 256 element pointer array to switch over the top 8 bits of the index and forward to a 24-bit indexed sorted array as above.


Quick benchmark

#include <algorithm>#include <vector>#include <stdint.h>#include <chrono>#include <stdio.h>#include <math.h>using namespace std::chrono;/// XorShift32 generator; extremely fast, 2^32-1 period, way better quality/// than LCG but fail some test suitesstruct XorShift32 {    /// This stuff allows to use this class wherever a library function    /// requires a UniformRandomBitGenerator (e.g. std::shuffle)    typedef uint32_t result_type;    static uint32_t min() { return 1; }    static uint32_t max() { return uint32_t(-1); }    /// PRNG state    uint32_t y;    /// Initializes with seed    XorShift32(uint32_t seed = 0) : y(seed) {        if(y == 0) y = 2463534242UL;    }    /// Returns a value in the range [1, 1<<32)    uint32_t operator()() {        y ^= (y<<13);        y ^= (y>>17);        y ^= (y<<15);        return y;    }    /// Returns a value in the range [0, limit); this conforms to the RandomFunc    /// requirements for std::random_shuffle    uint32_t operator()(uint32_t limit) {        return (*this)()%limit;    }};struct mean_variance {    double rmean = 0.;    double rvariance = 0.;    int count = 0;    void operator()(double x) {        ++count;        double ormean = rmean;        rmean     += (x-rmean)/count;        rvariance += (x-ormean)*(x-rmean);    }    double mean()     const { return rmean; }    double variance() const { return rvariance/(count-1); }    double stddev()   const { return std::sqrt(variance()); }};std::vector<uint8_t> main_arr;std::vector<uint32_t> sec_arr;uint8_t lookup(unsigned idx) {    // extract the 2 bits of our interest from the main array    uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;    // usual (likely) case: value between 0 and 2    if(v != 3) return v;    // bad case: lookup the index<<8 in the secondary array    // lower_bound finds the first >=, so we don't need to mask out the value    auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);#ifdef _DEBUG    // some coherency checks    if(ptr == sec_arr.end()) std::abort();    if((*ptr >> 8) != idx) std::abort();#endif    // extract our 8-bit value from the 32 bit (index, value) thingie    return (*ptr) & 0xff;}void populate(uint8_t *source, size_t size) {    main_arr.clear(); sec_arr.clear();    // size the main storage (round up)    main_arr.resize((size+3)/4);    for(size_t idx = 0; idx < size; ++idx) {        uint8_t in = source[idx];        uint8_t &target = main_arr[idx>>2];        // if the input doesn't fit, cap to 3 and put in secondary storage        if(in >= 3) {            // top 24 bits: index; low 8 bit: value            sec_arr.push_back((idx << 8) | in);            in = 3;        }        // store in the target according to the position        target |= in << ((idx & 3)*2);    }}volatile unsigned out;int main() {    XorShift32 xs;    std::vector<uint8_t> vec;    int size = 10000000;    for(int i = 0; i<size; ++i) {        uint32_t v = xs();        if(v < 1825361101)      v = 0; // 42.5%        else if(v < 4080218931) v = 1; // 95.0%        else if(v < 4252017623) v = 2; // 99.0%        else {            while((v & 0xff) < 3) v = xs();        }        vec.push_back(v);    }    populate(vec.data(), vec.size());    mean_variance lk_t, arr_t;    for(int i = 0; i<50; ++i) {        {            unsigned o = 0;            auto beg = high_resolution_clock::now();            for(int i = 0; i < size; ++i) {                o += lookup(xs() % size);            }            out += o;            int dur = (high_resolution_clock::now()-beg)/microseconds(1);            fprintf(stderr, "lookup: %10d µs\n", dur);            lk_t(dur);        }        {            unsigned o = 0;            auto beg = high_resolution_clock::now();            for(int i = 0; i < size; ++i) {                o += vec[xs() % size];            }            out += o;            int dur = (high_resolution_clock::now()-beg)/microseconds(1);            fprintf(stderr, "array:  %10d µs\n", dur);            arr_t(dur);        }    }    fprintf(stderr, " lookup |   ±  |  array  |   ±  | speedup\n");    printf("%7.0f | %4.0f | %7.0f | %4.0f | %0.2f\n",            lk_t.mean(), lk_t.stddev(),            arr_t.mean(), arr_t.stddev(),            arr_t.mean()/lk_t.mean());    return 0;}

(code and data always updated in my Bitbucket)

The code above populates a 10M element array with random data distributed as OP specified in their post, initializes my data structure and then:

  • performs a random lookup of 10M elements with my data structure
  • does the same through the original array.

(notice that in case of sequential lookup the array always wins by a huge measure, as it's the most cache-friendly lookup you can do)

These last two blocks are repeated 50 times and timed; at the end, the mean and standard deviation for each type of lookup are calculated and printed, along with the speedup (lookup_mean/array_mean).

I compiled the code above with g++ 5.4.0 (-O3 -static, plus some warnings) on Ubuntu 16.04, and ran it on some machines; most of them are running Ubuntu 16.04, some some older Linux, some some newer Linux. I don't think the OS should be relevant at all in this case.

            CPU           |  cache   |  lookup (µs)   |     array (µs)  | speedup (x)Xeon E5-1650 v3 @ 3.50GHz | 15360 KB |  60011 ±  3667 |   29313 ±  2137 | 0.49Xeon E5-2697 v3 @ 2.60GHz | 35840 KB |  66571 ±  7477 |   33197 ±  3619 | 0.50Celeron G1610T  @ 2.30GHz |  2048 KB | 172090 ±   629 |  162328 ±   326 | 0.94Core i3-3220T   @ 2.80GHz |  3072 KB | 111025 ±  5507 |  114415 ±  2528 | 1.03Core i5-7200U   @ 2.50GHz |  3072 KB |  92447 ±  1494 |   95249 ±  1134 | 1.03Xeon X3430      @ 2.40GHz |  8192 KB | 111303 ±   936 |  127647 ±  1503 | 1.15Core i7 920     @ 2.67GHz |  8192 KB | 123161 ± 35113 |  156068 ± 45355 | 1.27Xeon X5650      @ 2.67GHz | 12288 KB | 106015 ±  5364 |  140335 ±  6739 | 1.32Core i7 870     @ 2.93GHz |  8192 KB |  77986 ±   429 |  106040 ±  1043 | 1.36Core i7-6700    @ 3.40GHz |  8192 KB |  47854 ±   573 |   66893 ±  1367 | 1.40Core i3-4150    @ 3.50GHz |  3072 KB |  76162 ±   983 |  113265 ±   239 | 1.49Xeon X5650      @ 2.67GHz | 12288 KB | 101384 ±   796 |  152720 ±  2440 | 1.51Core i7-3770T   @ 2.50GHz |  8192 KB |  69551 ±  1961 |  128929 ±  2631 | 1.85

The results are... mixed!

  1. In general, on most of these machines there is some kind of speedup, or at least they are on a par.
  2. The two cases where the array truly trumps the "smart structure" lookup are on a machines with lots of cache and not particularly busy: the Xeon E5-1650 above (15 MB cache) is a night build machine, at the moment quite idle; the Xeon E5-2697 (35 MB cache) is a machine for high performance calculations, in an idle moment as well. It does make sense, the original array fits completely in their huge cache, so the compact data structure only adds complexity.
  3. At the opposite side of the "performance spectrum" - but where again the array is slightly faster, there's the humble Celeron that powers my NAS; it has so little cache that neither the array nor the "smart structure" fits in it at all. Other machines with cache small enough perform similarly.
  4. The Xeon X5650 must be taken with some caution - they are virtual machines on a quite busy dual-socket virtual machine server; it may well be that, although nominally it has a decent amount of cache, during the time of the test it gets preempted by completely unrelated virtual machines several times.


Another option could be

  • check if the result is 0, 1 or 2
  • if not do a regular lookup

In other words something like:

unsigned char lookup(int index) {    int code = (bmap[index>>2]>>(2*(index&3)))&3;    if (code != 3) return code;    return full_array[index];}

where bmap uses 2 bits per element with the value 3 meaning "other".

This structure is trivial to update, uses 25% more memory but the big part is looked up only in 5% of the cases. Of course, as usual, if it's a good idea or not depends on a lot of other conditions so the only answer is experimenting with real usage.


This is more of a "long comment" than a concrete answer

Unless your data is something that is something well-known, I doubt anyone can DIRECTLY answer your question (and I'm not aware of anything that matches your description, but then I don't know EVERYTHING about all kinds of data patterns for all kinds of use-cases). Sparse data is a common problem in high performance computing, but it's typically "we have a very large array, but only some values are non-zero".

For not well known patterns like what I think yours is, nobody will KNOW directly which is better, and it depends on the details: how random is the random access - is the system accessing clusters of data items, or is it completely random like from a uniform random number generator. Is the table data completely random, or are there sequences of 0 then sequences of 1, with a scattering of other values? Run length encoding would work well if you have reasonably long sequences of 0 and 1, but won't work if you have "checkerboard of 0/1". Also, you'd have to keep a table of "starting points", so you can work your way to the relevant place reasonably quickly.

I know from a long time back that some big databases are just a large table in RAM (telephone exchange subscriber data in this example), and one of the problems there is that caches and page-table optimisations in the processor is pretty useless. The caller is so rarely the same as one recently calling someone, that there is no pre-loaded data of any kind, it's just purely random. Big page-tables is the best optimisation for that type of access.

In a lot of cases, compromising between "speed and small size" is one of those things you have to pick between in software engineering [in other engineering it's not necessarily so much of a compromise]. So, "wasting memory for simpler code" is quite often the preferred choice. In this sense, the "simple" solution is quite likely better for speed, but if you have "better" use for the RAM, then optimising for size of the table would give you sufficient performance and a good improvement on size. There are lots of different ways you could achieve this - as suggested in a comment, a 2 bit field where the two or three most common values are stored, and then some alternative data format for the other values - a hash-table would be my first approach, but a list or binary tree may work too - again, it depends on the patterns of where your "not 0, 1 or 2" are. Again, it depends on how the values are "scattered" in the table - are they in clusters or are they more of an evenly distributed pattern?

But a problem with that is that you are still reading the data from RAM. You are then spending more code processing the data, including some code to cope with the "this is not a common value".

The problem with most common compression algorithms is that they are based on unpacking sequences, so you can't random access them. And the overhead of splitting your big data into chunks of, say, 256 entries at a time, and uncompressing the 256 into a uint8_t array, fetching the data you want, and then throwing away your uncompressed data, is highly unlikely to give you good performance - assuming that's of some importance, of course.

In the end, you will probably have to implement one or a few of the ideas in comments/answers to test out, see if it helps solving your problem, or if memory bus is still the main limiting factor.