Looking up large sets of keys: dictionary vs. NumPy array Looking up large sets of keys: dictionary vs. NumPy array numpy numpy

Looking up large sets of keys: dictionary vs. NumPy array


If certain special conditions apply, you can use NumPy indexing as a very fast alternative to dictionary lookups.

  • The keys must be integers

  • You have enough memory to create a NumPy array whose size is as big as themaximum key value you wish to look up (so that all keys correspond to a valid index into the array.)

The idea is to use

lookup_array = np.empty((M,), dtype=values.dtype)lookup_array[keys] = valuesresult = lookup_array[key_set]

instead of

result = {lookup_dict.get(key) for key in key_set}

For example,

import numpy as npimport pandas as pddef using_dict(lookup_dict, key_set):    return {lookup_dict.get(key) for key in key_set}def using_array(lookup_array, key_set):    return lookup_array[key_set]def using_pandas(df, key_set):    return df.loc[df['a'].isin(key_set)]M = 10**6N = 2*10**5K = 10**4keys = np.random.randint(M, size=(N,))values = np.random.random((N,))lookup_dict = dict(zip(keys, values))lookup_array = np.empty((M,), dtype=values.dtype)lookup_array[keys] = valuesdf = pd.DataFrame(np.column_stack([keys, values]), columns=list('ab'))key_set = np.random.choice(keys, size=(K,))

And here is a timeit benchmark (using IPython) for the methods above:

In [25]: %timeit using_array(lookup_array, key_set)10000 loops, best of 3: 22.4 µs per loopIn [26]: %timeit using_dict(lookup_dict, key_set)100 loops, best of 3: 3.73 ms per loopIn [24]: %timeit using_pandas(df, key_set)10 loops, best of 3: 38.9 ms per loop


Here's an approach with np.searchsorted -

row_idx = np.searchsorted(lookup_array[:,0],key_set)[key_set.argsort()]values = lookup_array[row_idx,1]

This assumes that lookup_array has the keys sorted in its first column. If that's not the case, you can use the optional sorter argument with np.searchsorted.


Loading a dictionary this huge in memory is kinda not good and then the added overhead of lookups. If this is a data structure you are using quite frequently how about using a database engine. There are KEY / VALUE databases if you don't like SQL. They are highly optimized for lookups.