How do you One Hot Encode columns with a list of strings as values?
I think you need str.join
with str.get_dummies
:
df = df['tickers'].str.join('|').str.get_dummies()
Or:
from sklearn.preprocessing import MultiLabelBinarizermlb = MultiLabelBinarizer()df = pd.DataFrame(mlb.fit_transform(df['tickers']),columns=mlb.classes_, index=df.index)print (df) AAPL ABT ADBE AMGN AMZN BABA BAY CVS DIS ECL EMR FAST GE \1 0 0 0 0 0 0 0 0 1 0 0 0 0 2 1 0 0 0 1 1 1 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 1 1 1 0 0 0 1 0 0 0 0 0 5 0 1 0 0 0 0 0 1 1 1 1 1 1 GOOGL MCDO PEP 1 0 0 0 2 0 0 0 3 0 1 1 4 0 0 0 5 1 0 0
You can use apply(pd.Series)
and then get_dummies()
:
df = pd.DataFrame({"tickers":[["DIS"], ["AAPL","AMZN","BABA","BAY"], ["MCDO","PEP"], ["ABT","ADBE","AMGN","CVS"], ["ABT","CVS","DIS","ECL","EMR","FAST","GE","GOOGL"]]})pd.get_dummies(df.tickers.apply(pd.Series), prefix="", prefix_sep="") AAPL ABT DIS MCDO ADBE AMZN CVS PEP AMGN BABA DIS BAY CVS ECL \0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 3 0 1 0 0 1 0 0 0 1 0 0 0 1 0 4 0 1 0 0 0 0 1 0 0 0 1 0 0 1 EMR FAST GE GOOGL 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 1 1 1 1