Looking up large sets of keys: dictionary vs. NumPy array
If certain special conditions apply, you can use NumPy indexing as a very fast alternative to dictionary lookups.
The keys must be integers
You have enough memory to create a NumPy array whose size is as big as themaximum key value you wish to look up (so that all keys correspond to a valid index into the array.)
The idea is to use
lookup_array = np.empty((M,), dtype=values.dtype)lookup_array[keys] = valuesresult = lookup_array[key_set]
instead of
result = {lookup_dict.get(key) for key in key_set}
For example,
import numpy as npimport pandas as pddef using_dict(lookup_dict, key_set): return {lookup_dict.get(key) for key in key_set}def using_array(lookup_array, key_set): return lookup_array[key_set]def using_pandas(df, key_set): return df.loc[df['a'].isin(key_set)]M = 10**6N = 2*10**5K = 10**4keys = np.random.randint(M, size=(N,))values = np.random.random((N,))lookup_dict = dict(zip(keys, values))lookup_array = np.empty((M,), dtype=values.dtype)lookup_array[keys] = valuesdf = pd.DataFrame(np.column_stack([keys, values]), columns=list('ab'))key_set = np.random.choice(keys, size=(K,))
And here is a timeit benchmark (using IPython) for the methods above:
In [25]: %timeit using_array(lookup_array, key_set)10000 loops, best of 3: 22.4 µs per loopIn [26]: %timeit using_dict(lookup_dict, key_set)100 loops, best of 3: 3.73 ms per loopIn [24]: %timeit using_pandas(df, key_set)10 loops, best of 3: 38.9 ms per loop
Here's an approach with np.searchsorted
-
row_idx = np.searchsorted(lookup_array[:,0],key_set)[key_set.argsort()]values = lookup_array[row_idx,1]
This assumes that lookup_array
has the keys sorted in its first column. If that's not the case, you can use the optional sorter argument with np.searchsorted
.
Loading a dictionary this huge in memory is kinda not good and then the added overhead of lookups. If this is a data structure you are using quite frequently how about using a database engine. There are KEY / VALUE databases if you don't like SQL. They are highly optimized for lookups.