Create dummies from column with multiple values in pandas
I know it's been a while since this question was asked, but there is (at least now there is) a one-liner that is supported by the documentation:
In [4]: dfOut[4]: label0 (a, c, e)1 (a, d)2 (b,)3 (d, e)In [5]: df['label'].str.join(sep='*').str.get_dummies(sep='*')Out[5]: a b c d e0 1 0 1 0 11 1 0 0 1 02 0 1 0 0 03 0 0 0 1 1
I have a somewhat cleaner solution. Assume we want to transform the following dataframe
pageid category0 0 a1 0 b2 1 a3 1 c
into
a b cpageid 0 1 1 01 1 0 1
One way to do it is to make use of scikit-learn's DictVectorizer. I would, however, be interested in learning about other methods.
df = pd.DataFrame(dict(pageid=[0, 0, 1, 1], category=['a', 'b', 'a', 'c']))grouped = df.groupby('pageid').category.apply(lambda lst: tuple((k, 1) for k in lst))category_dicts = [dict(tuples) for tuples in grouped]v = sklearn.feature_extraction.DictVectorizer(sparse=False)X = v.fit_transform(category_dicts)pd.DataFrame(X, columns=v.get_feature_names(), index=grouped.index)
You can generate the dummies dataframe with your raw data, isolate the columns that contains a given atom, and then store the result matches back to the atom column.
dfOut[28]: label0 A1 B2 C3 D4 A*C5 C*Ddummies = pd.get_dummies(df['label'])atom_col = [c for c in dummies.columns if '*' not in c]for col in atom_col: ...: df[col] = dummies[[c for c in dummies.columns if col in c]].sum(axis=1) ...: dfOut[32]: label A B C D0 A 1 0 0 01 B 0 1 0 02 C 0 0 1 03 D 0 0 0 14 A*C 1 0 1 05 C*D 0 0 1 1