Creating a |N| x |M| matrix from a hash-table Creating a |N| x |M| matrix from a hash-table numpy numpy

Creating a |N| x |M| matrix from a hash-table


I'm not sure if there is a way to completely avoid looping but I imagine it could be optimized by using itertools:

import itertoolsnested_loop_iter = itertools.product(n_vocab,m_vocab)#note that because it iterates over n_vocab first we will need to transpose it at the endprobs = np.fromiter(map(hashes.get, nested_loop_iter),dtype=float)probs.resize((len(n_vocab),len(m_vocab)))probs = probs.T


If your end goal is to read in your data from a .csv file, it might be easier to read the file directly using pandas.

import pandas as pddf = pd.read_csv('coocurence_data.csv', index_col=[0,1], header=None).unstack()probs = df.as_matrix()

this reads your data from the csv, makes the first two columns into a multi-index which corresponds to your two sets of words. It then unstacks the multi-index so that you have one set of words as the column labels and another as the index labels. This gives your your |N|*|M| matrix which can then be converted into a numpy array with the .as_matrix() function.

This doesn't really resolve your question about changing your {(n,m):prob} dictionary into a numpy array but given your intentions, this will allow you to avoid the need to create that dictionary altogether.

Also, if you're going to be reading in the csv anyway, reading it using pandas in the first place is going to be faster than using the builtin csv module anyway: see these benchmark tests here

EDIT

In order to query a specific value in your DataFrame based on the row and column labels, df.loc:

df.loc['xyz', 'abc']

where 'xyz' is your word in your row label and 'abc' is your column label. Also check out df.ix and df.iloc for other ways of querying specific cells in your DataFrame.


[a short extension of the answer of dr-xorile]

Most solution look good to me. Depends a little if you need speed or convenience.

I agree that you have basically a matrix in coo sparse format. You might want to look at https://docs.scipy.org/doc/scipy-0.18.1/reference/sparse.html

Only problem is that matrices need integer indices. So as long as you hashes are small enough to be quickly expressed as a np.int64 that should work. And the sparse format should allow an $O(1)$ access to all elements.

(Sorry for the brevity!)

rough outline

This could potentially be fast but is kind of hacky.

  1. get the data in sparse representation. I think you should pick coo_matrix to just hold your 2D hash map.

    a. load the CSV using numpy.fromtxt and use e.g. datatype ['>u8', '>u8', np.float32] to treat the hashes as string representations of unsigned 8byte integer numbers. If that does not work you might load strings and use numpy to convert it. Finally you have three tables of size N * M like your hash table and use these with the scipy sparse matrix representation of your choice.

    b. if you have the object already in memory you might be able to use the sparse constructor directly

  2. To access you need to parse your strings again

    prob = matrix[np.fromstring(key1, dtype='>u8'), np.fromstring(key2, dtype='>u8')]