How to get one hot encoding of specific words in a text in Pandas? How to get one hot encoding of specific words in a text in Pandas? pandas pandas

How to get one hot encoding of specific words in a text in Pandas?


Use sklearn.feature_extraction.text.CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizercv = CountVectorizer(vocabulary=toxic)r = pd.SparseDataFrame(cv.fit_transform(df['text']),                        df.index,                       cv.get_feature_names(),                        default_fill_value=0)

Result:

In [127]: rOut[127]:   bad  horrible  disguisting0    0         1            01    0         0            02    1         0            1In [128]: type(r)Out[128]: pandas.core.sparse.frame.SparseDataFrameIn [129]: r.info()<class 'pandas.core.sparse.frame.SparseDataFrame'>RangeIndex: 3 entries, 0 to 2Data columns (total 3 columns):bad            3 non-null int64horrible       3 non-null int64disguisting    3 non-null int64dtypes: int64(3)memory usage: 104.0 bytesIn [130]: r.memory_usage()Out[130]:Index          80bad             8   #  <--- NOTE: it's using 8 bytes (1x int64) instead of 24 bytes for three values (3x8)horrible        8disguisting     8dtype: int64

joining SparseDataFrame with the original DataFrame:

In [137]: r2 = df.join(r)In [138]: r2Out[138]:                          text  bad  horrible  disguisting0            You look horrible    0         1            01                 You are good    0         0            02  you are bad and disguisting    1         0            1In [139]: r2.memory_usage()Out[139]:Index          80text           24bad             8horrible        8disguisting     8dtype: int64In [140]: type(r2)Out[140]: pandas.core.frame.DataFrameIn [141]: type(r2['horrible'])Out[141]: pandas.core.sparse.series.SparseSeriesIn [142]: type(r2['text'])Out[142]: pandas.core.series.Series

PS in older Pandas versions Sparsed columns loosed their sparsity (got densed) after joining SparsedDataFrame with a regular DataFrame, now we can have a mixture of regular Series (columns) and SparseSeries - really nice feature!


The accepted answer is deprecated, see release notes:

SparseSeries and SparseDataFrame were removed in pandas 1.0.0. This migration guide is present to aid in migrating from previous versions.

Pandas 1.0.5 Solution:

r = df = pd.DataFrame.sparse.from_spmatrix(cv.fit_transform(df['text']),                    df.index,                   cv.get_feature_names())