Split / Explode a column of dictionaries into separate columns with pandas
To convert the string to an actual dict, you can do df['Pollutant Levels'].map(eval)
. Afterwards, the solution below can be used to convert the dict to different columns.
Using a small example, you can use .apply(pd.Series)
:
In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})In [3]: dfOut[3]: a b0 1 {u'c': 1}1 2 {u'd': 3}2 3 {u'c': 5, u'd': 6}In [4]: df['b'].apply(pd.Series)Out[4]: c d0 1.0 NaN1 NaN 3.02 5.0 6.0
To combine it with the rest of the dataframe, you can concat
the other columns with the above result:
In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)Out[7]: a c d0 1 1.0 NaN1 2 NaN 3.02 3 5.0 6.0
Using your code, this also works if I leave out the iloc
part:
In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)Out[15]: a c d0 1 1.0 NaN1 2 NaN 3.02 3 5.0 6.0
I know the question is quite old, but I got here searching for answers. There is actually a better (and faster) way now of doing this using json_normalize
:
import pandas as pddf2 = pd.json_normalize(df['Pollutant Levels'])
This avoids costly apply functions...
- The fastest method to normalize a column of flat, one-level
dicts
, as per the timing analysis performed by Shijith in this answer:df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))
- It will not resolve other issues with columns of
list
ordicts
that are addressed below, such as rows withNaN
, or nesteddicts
.
pd.json_normalize(df.Pollutants)
is significantly faster thandf.Pollutants.apply(pd.Series)
- See the
%%timeit
below. For 1M rows,.json_normalize
is 47 times faster than.apply
.
- See the
- Whether reading data from a file, or from an object returned by a database, or API, it may not be clear if the
dict
column hasdict
orstr
type.- If the dictionaries in the column are
str
type, they must be converted back to adict
type, usingast.literal_eval
, orjson.loads(…)
.
- If the dictionaries in the column are
- Use
pd.json_normalize
to convert thedicts
, withkeys
as headers andvalues
for rows.- There are additional parameters (e.g.
record_path
&meta
) for dealing with nesteddicts
.
- There are additional parameters (e.g.
- Use
pandas.DataFrame.join
to combine the original DataFrame,df
, with the columns created usingpd.json_normalize
- If the index isn't integers (as in the example), first use
df.reset_index()
to get an index of integers, before doing the normalize and join.
- If the index isn't integers (as in the example), first use
- Finally, use
pandas.DataFrame.drop
, to remove the unneeded column ofdicts
- As a note, if the column has any
NaN
, they must be filled with an emptydict
df.Pollutants = df.Pollutants.fillna({i: {} for i in df.index})
- If the
'Pollutants'
column is strings, use'{}'
. - Also see How to json_normalize a column with NaNs?.
- If the
import pandas as pdfrom ast import literal_evalimport numpy as npdata = {'Station ID': [8809, 8810, 8811, 8812, 8813, 8814], 'Pollutants': ['{"a": "46", "b": "3", "c": "12"}', '{"a": "36", "b": "5", "c": "8"}', '{"b": "2", "c": "7"}', '{"c": "11"}', '{"a": "82", "c": "15"}', np.nan]}df = pd.DataFrame(data)# display(df) Station ID Pollutants0 8809 {"a": "46", "b": "3", "c": "12"}1 8810 {"a": "36", "b": "5", "c": "8"}2 8811 {"b": "2", "c": "7"}3 8812 {"c": "11"}4 8813 {"a": "82", "c": "15"}5 8814 NaN# replace NaN with '{}' if the column is strings, otherwise replace with {}# df.Pollutants = df.Pollutants.fillna('{}') # if the NaN is in a column of stringsdf.Pollutants = df.Pollutants.fillna({i: {} for i in df.index}) # if the column is not strings# Convert the column of stringified dicts to dicts# skip this line, if the column contains dictsdf.Pollutants = df.Pollutants.apply(literal_eval)# reset the index if the index is not unique integers from 0 to n-1# df.reset_index(inplace=True) # uncomment if needed# normalize the column of dictionaries and join it to dfdf = df.join(pd.json_normalize(df.Pollutants))# drop Pollutantsdf.drop(columns=['Pollutants'], inplace=True)# display(df) Station ID a b c0 8809 46 3 121 8810 36 5 82 8811 NaN 2 73 8812 NaN NaN 114 8813 82 NaN 155 8814 NaN NaN NaN
%%timeit
# dataframe with 1M rowsdfb = pd.concat([df]*200000).reset_index(drop=True)%%timeitdfb.join(pd.json_normalize(dfb.Pollutants))[out]:5.44 s ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%%timeitpd.concat([dfb.drop(columns=['Pollutants']), dfb.Pollutants.apply(pd.Series)], axis=1)[out]:4min 17s ± 2.44 s per loop (mean ± std. dev. of 7 runs, 1 loop each)