Create dummies from column with multiple values in pandas Create dummies from column with multiple values in pandas pandas pandas

Create dummies from column with multiple values in pandas


I know it's been a while since this question was asked, but there is (at least now there is) a one-liner that is supported by the documentation:

In [4]: dfOut[4]:      label0  (a, c, e)1     (a, d)2       (b,)3     (d, e)In [5]: df['label'].str.join(sep='*').str.get_dummies(sep='*')Out[5]:   a  b  c  d  e0  1  0  1  0  11  1  0  0  1  02  0  1  0  0  03  0  0  0  1  1


I have a somewhat cleaner solution. Assume we want to transform the following dataframe

   pageid category0       0        a1       0        b2       1        a3       1        c

into

        a  b  cpageid         0       1  1  01       1  0  1

One way to do it is to make use of scikit-learn's DictVectorizer. I would, however, be interested in learning about other methods.

df = pd.DataFrame(dict(pageid=[0, 0, 1, 1], category=['a', 'b', 'a', 'c']))grouped = df.groupby('pageid').category.apply(lambda lst: tuple((k, 1) for k in lst))category_dicts = [dict(tuples) for tuples in grouped]v = sklearn.feature_extraction.DictVectorizer(sparse=False)X = v.fit_transform(category_dicts)pd.DataFrame(X, columns=v.get_feature_names(), index=grouped.index)


You can generate the dummies dataframe with your raw data, isolate the columns that contains a given atom, and then store the result matches back to the atom column.

dfOut[28]:   label0     A1     B2     C3     D4   A*C5   C*Ddummies = pd.get_dummies(df['label'])atom_col = [c for c in dummies.columns if '*' not in c]for col in atom_col:    ...:     df[col] = dummies[[c for c in dummies.columns if col in c]].sum(axis=1)    ...:     dfOut[32]:   label  A  B  C  D0     A  1  0  0  01     B  0  1  0  02     C  0  0  1  03     D  0  0  0  14   A*C  1  0  1  05   C*D  0  0  1  1