How to get one hot encoding of specific words in a text in Pandas?
Use sklearn.feature_extraction.text.CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizercv = CountVectorizer(vocabulary=toxic)r = pd.SparseDataFrame(cv.fit_transform(df['text']), df.index, cv.get_feature_names(), default_fill_value=0)
Result:
In [127]: rOut[127]: bad horrible disguisting0 0 1 01 0 0 02 1 0 1In [128]: type(r)Out[128]: pandas.core.sparse.frame.SparseDataFrameIn [129]: r.info()<class 'pandas.core.sparse.frame.SparseDataFrame'>RangeIndex: 3 entries, 0 to 2Data columns (total 3 columns):bad 3 non-null int64horrible 3 non-null int64disguisting 3 non-null int64dtypes: int64(3)memory usage: 104.0 bytesIn [130]: r.memory_usage()Out[130]:Index 80bad 8 # <--- NOTE: it's using 8 bytes (1x int64) instead of 24 bytes for three values (3x8)horrible 8disguisting 8dtype: int64
joining SparseDataFrame with the original DataFrame:
In [137]: r2 = df.join(r)In [138]: r2Out[138]: text bad horrible disguisting0 You look horrible 0 1 01 You are good 0 0 02 you are bad and disguisting 1 0 1In [139]: r2.memory_usage()Out[139]:Index 80text 24bad 8horrible 8disguisting 8dtype: int64In [140]: type(r2)Out[140]: pandas.core.frame.DataFrameIn [141]: type(r2['horrible'])Out[141]: pandas.core.sparse.series.SparseSeriesIn [142]: type(r2['text'])Out[142]: pandas.core.series.Series
PS in older Pandas versions Sparsed columns loosed their sparsity (got densed) after joining SparsedDataFrame with a regular DataFrame, now we can have a mixture of regular Series (columns) and SparseSeries - really nice feature!
The accepted answer is deprecated, see release notes:
SparseSeries and SparseDataFrame were removed in pandas 1.0.0. This migration guide is present to aid in migrating from previous versions.
Pandas 1.0.5 Solution:
r = df = pd.DataFrame.sparse.from_spmatrix(cv.fit_transform(df['text']), df.index, cv.get_feature_names())