Efficiently create sparse pivot tables in pandas? Efficiently create sparse pivot tables in pandas? pandas pandas

Efficiently create sparse pivot tables in pandas?


Here is a method that creates a sparse scipy matrix based on data and indices of person and thing. person_u and thing_u are lists representing the unique entries for your rows and columns of pivot you want to create. Note: this assumes that your count column already has the value you want in it.

from scipy.sparse import csr_matrixperson_u = list(sort(frame.person.unique()))thing_u = list(sort(frame.thing.unique()))data = frame['count'].tolist()row = frame.person.astype('category', categories=person_u).cat.codescol = frame.thing.astype('category', categories=thing_u).cat.codessparse_matrix = csr_matrix((data, (row, col)), shape=(len(person_u), len(thing_u)))>>> sparse_matrix <3x4 sparse matrix of type '<type 'numpy.int64'>'    with 6 stored elements in Compressed Sparse Row format>>>> sparse_matrix.todense()matrix([[0, 1, 0, 1],        [1, 0, 0, 1],        [1, 0, 1, 0]])

Based on your original question, the scipy sparse matrix should be sufficient for your needs, but should you wish to have a sparse dataframe you can do the following:

dfs=pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0)                               for i in np.arange(sparse_matrix.shape[0]) ], index=person_u, columns=thing_u, default_fill_value=0)>>> dfs     a  b  c  dhim  0  1  0  1me   1  0  0  1you  1  0  1  0>>> type(dfs)pandas.sparse.frame.SparseDataFrame


The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:

from scipy.sparse import csr_matrixfrom pandas.api.types import CategoricalDtypeperson_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)row = frame.person.astype(person_c).cat.codescol = frame.thing.astype(thing_c).cat.codessparse_matrix = csr_matrix((frame["count"], (row, col)), \                           shape=(person_c.categories.size, thing_c.categories.size))>>> sparse_matrix<3x4 sparse matrix of type '<class 'numpy.int64'>'     with 6 stored elements in Compressed Sparse Row format>>>> sparse_matrix.todense()matrix([[0, 1, 0, 1],        [1, 0, 0, 1],        [1, 0, 1, 0]], dtype=int64)dfs = pd.SparseDataFrame(sparse_matrix, \                         index=person_c.categories, \                         columns=thing_c.categories, \                         default_fill_value=0)>>> dfs        a   b   c   d him    0   1   0   1  me    1   0   0   1 you    1   0   1   0

The main changes were:

  • .astype() no longer accepts "categorical". You have to create a CategoricalDtype object.
  • sort() doesn't work anymore

Other changes were more superficial:

  • using the category sizes instead of a length of the uniqued Series objects, just because I didn't want to make another object unnecessarily
  • the data input for the csr_matrix (frame["count"]) doesn't need to be a list object
  • pandas SparseDataFrame accepts a scipy.sparse object directly now


I had a similar problem and I stumbled over this post. The only difference was that that I had two columns in the DataFrame that define the "row dimension" (i) of the output matrix. I thought this might be an interesting generalisation, I used the grouper:

# functionimport pandas as pdfrom scipy.sparse import csr_matrixdef df_to_sm(data, vars_i, vars_j):    grpr_i = data.groupby(vars_i).grouper    idx_i = grpr_i.group_info[0]    grpr_j = data.groupby(vars_j).grouper    idx_j = grpr_j.group_info[0]    data_sm = csr_matrix((data['val'].values, (idx_i, idx_j)),                         shape=(grpr_i.ngroups, grpr_j.ngroups))    return data_sm, grpr_i, grpr_j# exampledata = pd.DataFrame({'var_i_1' : ['a1', 'a1', 'a1', 'a2', 'a2', 'a3'],                     'var_i_2' : ['b2', 'b1', 'b1', 'b1', 'b1', 'b4'],                     'var_j_1' : ['c2', 'c3', 'c2', 'c1', 'c2', 'c3'],                     'val' : [1, 2, 3, 4, 5, 6]})data_sm, _, _ = df_to_sm(data, ['var_i_1', 'var_i_2'], ['var_j_1'])data_sm.todense()