How to speed up multiple inner products in python How to speed up multiple inner products in python python python

How to speed up multiple inner products in python


This new code gets another order of magnitude speedup by taking advantage of the cyclic symmetry of the problem. This Python version enumerates necklaces with Duval's algorithm; the C version uses brute force. Both incorporate the speedups described below. On my machine, the C version solves n = 20 in 100 seconds! A back-of-the-envelope calculation suggests that, if you were to let it run for a week on a single core, it could do n = 26, and, as noted below, it's amenable to parallelism.

import itertoolsdef necklaces_with_multiplicity(n):    assert isinstance(n, int)    assert n > 0    w = [1] * n    i = 1    while True:        if n % i == 0:            s = sum(w)            if s > 0:                yield (tuple(w), i * 2)            elif s == 0:                yield (tuple(w), i)        i = n - 1        while w[i] == -1:            if i == 0:                return            i -= 1        w[i] = -1        i += 1        for j in range(n - i):            w[i + j] = w[j]def leading_zero_counts(n):    assert isinstance(n, int)    assert n > 0    assert n % 2 == 0    counts = [0] * n    necklaces = list(necklaces_with_multiplicity(n))    for combo in itertools.combinations(range(n - 1), n // 2):        for v, multiplicity in necklaces:            w = list(v)            for j in combo:                w[j] *= -1            for i in range(n):                counts[i] += multiplicity * 2                product = 0                for j in range(n):                    product += v[j - (i + 1)] * w[j]                if product != 0:                    break    return countsif __name__ == '__main__':    print(leading_zero_counts(12))

C version:

#include <stdio.h>enum {  N = 14};struct Necklace {  unsigned int v;  int multiplicity;};static struct Necklace g_necklace[1 << (N - 1)];static int g_necklace_count;static void initialize_necklace(void) {  g_necklace_count = 0;  for (unsigned int v = 0; v < (1U << (N - 1)); v++) {    int multiplicity;    unsigned int w = v;    for (multiplicity = 2; multiplicity < 2 * N; multiplicity += 2) {      w = ((w & 1) << (N - 1)) | (w >> 1);      unsigned int x = w ^ ((1U << N) - 1);      if (w < v || x < v) goto nope;      if (w == v || x == v) break;    }    g_necklace[g_necklace_count].v = v;    g_necklace[g_necklace_count].multiplicity = multiplicity;    g_necklace_count++;   nope:    ;  }}int main(void) {  initialize_necklace();  long long leading_zero_count[N + 1];  for (int i = 0; i < N + 1; i++) leading_zero_count[i] = 0;  for (unsigned int v_xor_w = 0; v_xor_w < (1U << (N - 1)); v_xor_w++) {    if (__builtin_popcount(v_xor_w) != N / 2) continue;    for (int k = 0; k < g_necklace_count; k++) {      unsigned int v = g_necklace[k].v;      unsigned int w = v ^ v_xor_w;      for (int i = 0; i < N + 1; i++) {        leading_zero_count[i] += g_necklace[k].multiplicity;        w = ((w & 1) << (N - 1)) | (w >> 1);        if (__builtin_popcount(v ^ w) != N / 2) break;      }    }  }  for (int i = 0; i < N + 1; i++) {    printf(" %lld", 2 * leading_zero_count[i]);  }  putchar('\n');  return 0;}

You can get a bit of speedup by exploiting the sign symmetry (4x) and by iterating over only those vectors that pass the first inner product test (asymptotically, O(sqrt(n))x).

import itertoolsn = 10m = n + 1def innerproduct(A, B):    s = 0    for k in range(n):        s += A[k] * B[k]    return sleadingzerocounts = [0] * mfor S in itertools.product([-1, 1], repeat=n - 1):    S1 = S + (1,)    S1S1 = S1 * 2    for C in itertools.combinations(range(n - 1), n // 2):        F = list(S1)        for i in C:            F[i] *= -1        leadingzerocounts[0] += 4        for i in range(1, m):            if innerproduct(F, S1S1[i:i + n]):                break            leadingzerocounts[i] += 4print(leadingzerocounts)

C version, to get a feel for how much performance we're losing to PyPy (16 for PyPy is roughly equivalent to 18 for C):

#include <stdio.h>enum {  HALFN = 9,  N = 2 * HALFN};int main(void) {  long long lzc[N + 1];  for (int i = 0; i < N + 1; i++) lzc[i] = 0;  unsigned int xor = 1 << (N - 1);  while (xor-- > 0) {    if (__builtin_popcount(xor) != HALFN) continue;    unsigned int s = 1 << (N - 1);    while (s-- > 0) {      lzc[0]++;      unsigned int f = xor ^ s;      for (int i = 1; i < N + 1; i++) {        f = ((f & 1) << (N - 1)) | (f >> 1);        if (__builtin_popcount(f ^ s) != HALFN) break;        lzc[i]++;      }    }  }  for (int i = 0; i < N + 1; i++) printf(" %lld", 4 * lzc[i]);  putchar('\n');  return 0;}

This algorithm is embarrassingly parallel because it's just accumulating over all values of xor. With the C version, a back-of-the-envelope calculation suggests that a few thousand hours of CPU time would suffice to calculate n = 26, which works out to a couple hundred dollars at current rates on EC2. There are undoubtedly some optimizations to be made (e.g., vectorization), but for a one-off like this I'm not sure how much more programmer effort is worthwhile.


One very simple speed up of a factor of n is to change this code:

def innerproduct(A, B):    assert (len(A) == len(B))    for j in xrange(len(A)):        s = 0         for k in xrange(0,n):            s+=A[k]*B[k]    return s

to

def innerproduct(A, B):    assert (len(A) == len(B))    s = 0     for k in xrange(0,n):        s+=A[k]*B[k]    return s

(I don't know why you have the loop over j, but it just does the same calculation each time so is unnecessary.)


I tried to speed this up and I failed badly :(But I'm sending the code, it's somehow faster, but not fast enough for values like n=24.

My assumptions

Your lists consist from values, so I decided to use numbers instead of lists - each bit represents one of possible values: if bit is set, then this means 1, if it's zeroed it means -1. The only possible result of multiplication {-1, 1} is 1 or -1, so I used bitwise XOR instead of multiplication. I also noticed that there's a symmetry, so you only need to check subset (one fourth) of possible lists and multiply result by 4 (David explained this in his answer).

Finally I put results of possible operations to tables to eliminate need of calculations. It takes a lot of memory, but who cares (for n=24 it was about 150MB)?

And then @David Eisenstat answered the question :)So, I took his code and modified to bit-based. It's about 2-3 times faster (for n=16 it took ~30 seconds, compared to ~90 of David's solution), but I think it's still not enough to get results for n=26 or so.

import itertoolsn = 16m = n + 1mask = (2 ** n) - 1# Create table of sum results (replaces innerproduct())tab = []for a in range(2 ** n):    s = 0    for k in range(n):        s += -1 if a & 1 else 1        a >>= 1    tab.append(s)# Create combination bit masks for combinationscomb = []for C in itertools.combinations(range(n - 1), n // 2):    xor = 0    for i in C:       xor |= (1 << i)    comb.append(xor)leadingzerocounts = [0] * mfor S in xrange(2 ** (n-1)):    S1 = S + (1 << (n-1))    S1S1 = S1 + (S1 << n)    for xor in comb:        F = S1 ^ xor        leadingzerocounts[0] += 4        for i in range(1, m):            if tab[F ^ ((S1S1 >> i) & mask)]:                break            leadingzerocounts[i] += 4print(leadingzerocounts)

Conclusions

I thought I invented something brilliant and hoped that all this mess with bits will give a great speed boost, but the boost was disappointingly small :(

I think the reason is the way Python uses operators - it calls function for every arithmetic (or logical) operation, even if it could be done by single assembler command (I hoped pypy will be able to simplify operations to that level, but it didn't). So, probably if C (or ASM) was used with this bit-operating solution, it would perform great (maybe you could get to n=24).