How to speed up multiple inner products in python

python performance algorithm cython numba

This new code gets another order of magnitude speedup by taking advantage of the cyclic symmetry of the problem. This Python version enumerates necklaces with Duval's algorithm; the C version uses brute force. Both incorporate the speedups described below. On my machine, the C version solves n = 20 in 100 seconds! A back-of-the-envelope calculation suggests that, if you were to let it run for a week on a single core, it could do n = 26, and, as noted below, it's amenable to parallelism.

import itertoolsdef necklaces_with_multiplicity(n):    assert isinstance(n, int)    assert n > 0    w = [1] * n    i = 1    while True:        if n % i == 0:            s = sum(w)            if s > 0:                yield (tuple(w), i * 2)            elif s == 0:                yield (tuple(w), i)        i = n - 1        while w[i] == -1:            if i == 0:                return            i -= 1        w[i] = -1        i += 1        for j in range(n - i):            w[i + j] = w[j]def leading_zero_counts(n):    assert isinstance(n, int)    assert n > 0    assert n % 2 == 0    counts = [0] * n    necklaces = list(necklaces_with_multiplicity(n))    for combo in itertools.combinations(range(n - 1), n // 2):        for v, multiplicity in necklaces:            w = list(v)            for j in combo:                w[j] *= -1            for i in range(n):                counts[i] += multiplicity * 2                product = 0                for j in range(n):                    product += v[j - (i + 1)] * w[j]                if product != 0:                    break    return countsif __name__ == '__main__':    print(leading_zero_counts(12))

C version:

#include <stdio.h>enum {  N = 14};struct Necklace {  unsigned int v;  int multiplicity;};static struct Necklace g_necklace[1 << (N - 1)];static int g_necklace_count;static void initialize_necklace(void) {  g_necklace_count = 0;  for (unsigned int v = 0; v < (1U << (N - 1)); v++) {    int multiplicity;    unsigned int w = v;    for (multiplicity = 2; multiplicity < 2 * N; multiplicity += 2) {      w = ((w & 1) << (N - 1)) | (w >> 1);      unsigned int x = w ^ ((1U << N) - 1);      if (w < v || x < v) goto nope;      if (w == v || x == v) break;    }    g_necklace[g_necklace_count].v = v;    g_necklace[g_necklace_count].multiplicity = multiplicity;    g_necklace_count++;   nope:    ;  }}int main(void) {  initialize_necklace();  long long leading_zero_count[N + 1];  for (int i = 0; i < N + 1; i++) leading_zero_count[i] = 0;  for (unsigned int v_xor_w = 0; v_xor_w < (1U << (N - 1)); v_xor_w++) {    if (__builtin_popcount(v_xor_w) != N / 2) continue;    for (int k = 0; k < g_necklace_count; k++) {      unsigned int v = g_necklace[k].v;      unsigned int w = v ^ v_xor_w;      for (int i = 0; i < N + 1; i++) {        leading_zero_count[i] += g_necklace[k].multiplicity;        w = ((w & 1) << (N - 1)) | (w >> 1);        if (__builtin_popcount(v ^ w) != N / 2) break;      }    }  }  for (int i = 0; i < N + 1; i++) {    printf(" %lld", 2 * leading_zero_count[i]);  }  putchar('\n');  return 0;}

You can get a bit of speedup by exploiting the sign symmetry (4x) and by iterating over only those vectors that pass the first inner product test (asymptotically, O(sqrt(n))x).

import itertoolsn = 10m = n + 1def innerproduct(A, B):    s = 0    for k in range(n):        s += A[k] * B[k]    return sleadingzerocounts = [0] * mfor S in itertools.product([-1, 1], repeat=n - 1):    S1 = S + (1,)    S1S1 = S1 * 2    for C in itertools.combinations(range(n - 1), n // 2):        F = list(S1)        for i in C:            F[i] *= -1        leadingzerocounts[0] += 4        for i in range(1, m):            if innerproduct(F, S1S1[i:i + n]):                break            leadingzerocounts[i] += 4print(leadingzerocounts)

C version, to get a feel for how much performance we're losing to PyPy (16 for PyPy is roughly equivalent to 18 for C):

#include <stdio.h>enum {  HALFN = 9,  N = 2 * HALFN};int main(void) {  long long lzc[N + 1];  for (int i = 0; i < N + 1; i++) lzc[i] = 0;  unsigned int xor = 1 << (N - 1);  while (xor-- > 0) {    if (__builtin_popcount(xor) != HALFN) continue;    unsigned int s = 1 << (N - 1);    while (s-- > 0) {      lzc[0]++;      unsigned int f = xor ^ s;      for (int i = 1; i < N + 1; i++) {        f = ((f & 1) << (N - 1)) | (f >> 1);        if (__builtin_popcount(f ^ s) != HALFN) break;        lzc[i]++;      }    }  }  for (int i = 0; i < N + 1; i++) printf(" %lld", 4 * lzc[i]);  putchar('\n');  return 0;}

This algorithm is embarrassingly parallel because it's just accumulating over all values of xor. With the C version, a back-of-the-envelope calculation suggests that a few thousand hours of CPU time would suffice to calculate n = 26, which works out to a couple hundred dollars at current rates on EC2. There are undoubtedly some optimizations to be made (e.g., vectorization), but for a one-off like this I'm not sure how much more programmer effort is worthwhile.

python performance algorithm cython numba

One very simple speed up of a factor of n is to change this code:

def innerproduct(A, B):    assert (len(A) == len(B))    for j in xrange(len(A)):        s = 0         for k in xrange(0,n):            s+=A[k]*B[k]    return s

def innerproduct(A, B):    assert (len(A) == len(B))    s = 0     for k in xrange(0,n):        s+=A[k]*B[k]    return s

(I don't know why you have the loop over j, but it just does the same calculation each time so is unnecessary.)

python performance algorithm cython numba

I tried to speed this up and I failed badly :(But I'm sending the code, it's somehow faster, but not fast enough for values like n=24.

My assumptions

Your lists consist from values, so I decided to use numbers instead of lists - each bit represents one of possible values: if bit is set, then this means 1, if it's zeroed it means -1. The only possible result of multiplication {-1, 1} is 1 or -1, so I used bitwise XOR instead of multiplication. I also noticed that there's a symmetry, so you only need to check subset (one fourth) of possible lists and multiply result by 4 (David explained this in his answer).

Finally I put results of possible operations to tables to eliminate need of calculations. It takes a lot of memory, but who cares (for n=24 it was about 150MB)?

And then @David Eisenstat answered the question :)So, I took his code and modified to bit-based. It's about 2-3 times faster (for n=16 it took ~30 seconds, compared to ~90 of David's solution), but I think it's still not enough to get results for n=26 or so.

import itertoolsn = 16m = n + 1mask = (2 ** n) - 1# Create table of sum results (replaces innerproduct())tab = []for a in range(2 ** n):    s = 0    for k in range(n):        s += -1 if a & 1 else 1        a >>= 1    tab.append(s)# Create combination bit masks for combinationscomb = []for C in itertools.combinations(range(n - 1), n // 2):    xor = 0    for i in C:       xor |= (1 << i)    comb.append(xor)leadingzerocounts = [0] * mfor S in xrange(2 ** (n-1)):    S1 = S + (1 << (n-1))    S1S1 = S1 + (S1 << n)    for xor in comb:        F = S1 ^ xor        leadingzerocounts[0] += 4        for i in range(1, m):            if tab[F ^ ((S1S1 >> i) & mask)]:                break            leadingzerocounts[i] += 4print(leadingzerocounts)

Conclusions

I thought I invented something brilliant and hoped that all this mess with bits will give a great speed boost, but the boost was disappointingly small :(

I think the reason is the way Python uses operators - it calls function for every arithmetic (or logical) operation, even if it could be done by single assembler command (I hoped pypy will be able to simplify operations to that level, but it didn't). So, probably if C (or ASM) was used with this bit-operating solution, it would perform great (maybe you could get to n=24).

CodeHunter

How to speed up multiple inner products in python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last