Split / Explode a column of dictionaries into separate columns with pandas Split / Explode a column of dictionaries into separate columns with pandas json json

Split / Explode a column of dictionaries into separate columns with pandas

To convert the string to an actual dict, you can do df['Pollutant Levels'].map(eval). Afterwards, the solution below can be used to convert the dict to different columns.

Using a small example, you can use .apply(pd.Series):

In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})In [3]: dfOut[3]:   a                   b0  1           {u'c': 1}1  2           {u'd': 3}2  3  {u'c': 5, u'd': 6}In [4]: df['b'].apply(pd.Series)Out[4]:     c    d0  1.0  NaN1  NaN  3.02  5.0  6.0

To combine it with the rest of the dataframe, you can concat the other columns with the above result:

In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)Out[7]:   a    c    d0  1  1.0  NaN1  2  NaN  3.02  3  5.0  6.0

Using your code, this also works if I leave out the iloc part:

In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)Out[15]:   a    c    d0  1  1.0  NaN1  2  NaN  3.02  3  5.0  6.0

I know the question is quite old, but I got here searching for answers. There is actually a better (and faster) way now of doing this using json_normalize:

import pandas as pddf2 = pd.json_normalize(df['Pollutant Levels'])

This avoids costly apply functions...

  • The fastest method to normalize a column of flat, one-level dicts, as per the timing analysis performed by Shijith in this answer:
    • df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))
    • It will not resolve other issues with columns of list or dicts that are addressed below, such as rows with NaN, or nested dicts.
  1. pd.json_normalize(df.Pollutants) is significantly faster than df.Pollutants.apply(pd.Series)
    • See the %%timeit below. For 1M rows, .json_normalize is 47 times faster than .apply.
  2. Whether reading data from a file, or from an object returned by a database, or API, it may not be clear if the dict column has dict or str type.
    • If the dictionaries in the column are str type, they must be converted back to a dict type, using ast.literal_eval, or json.loads(…).
  3. Use pd.json_normalize to convert the dicts, with keys as headers and values for rows.
    • There are additional parameters (e.g. record_path & meta) for dealing with nested dicts.
  4. Use pandas.DataFrame.join to combine the original DataFrame, df, with the columns created using pd.json_normalize
    • If the index isn't integers (as in the example), first use df.reset_index() to get an index of integers, before doing the normalize and join.
  5. Finally, use pandas.DataFrame.drop, to remove the unneeded column of dicts
  • As a note, if the column has any NaN, they must be filled with an empty dict
import pandas as pdfrom ast import literal_evalimport numpy as npdata = {'Station ID': [8809, 8810, 8811, 8812, 8813, 8814],        'Pollutants': ['{"a": "46", "b": "3", "c": "12"}', '{"a": "36", "b": "5", "c": "8"}', '{"b": "2", "c": "7"}', '{"c": "11"}', '{"a": "82", "c": "15"}', np.nan]}df = pd.DataFrame(data)# display(df)   Station ID                        Pollutants0        8809  {"a": "46", "b": "3", "c": "12"}1        8810   {"a": "36", "b": "5", "c": "8"}2        8811              {"b": "2", "c": "7"}3        8812                       {"c": "11"}4        8813            {"a": "82", "c": "15"}5        8814                               NaN# replace NaN with '{}' if the column is strings, otherwise replace with {}# df.Pollutants = df.Pollutants.fillna('{}')  # if the NaN is in a column of stringsdf.Pollutants = df.Pollutants.fillna({i: {} for i in df.index})  # if the column is not strings# Convert the column of stringified dicts to dicts# skip this line, if the column contains dictsdf.Pollutants = df.Pollutants.apply(literal_eval)# reset the index if the index is not unique integers from 0 to n-1# df.reset_index(inplace=True)  # uncomment if needed# normalize the column of dictionaries and join it to dfdf = df.join(pd.json_normalize(df.Pollutants))# drop Pollutantsdf.drop(columns=['Pollutants'], inplace=True)# display(df)   Station ID    a    b    c0        8809   46    3   121        8810   36    5    82        8811  NaN    2    73        8812  NaN  NaN   114        8813   82  NaN   155        8814  NaN  NaN  NaN


# dataframe with 1M rowsdfb = pd.concat([df]*200000).reset_index(drop=True)%%timeitdfb.join(pd.json_normalize(dfb.Pollutants))[out]:5.44 s ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%%timeitpd.concat([dfb.drop(columns=['Pollutants']), dfb.Pollutants.apply(pd.Series)], axis=1)[out]:4min 17s ± 2.44 s per loop (mean ± std. dev. of 7 runs, 1 loop each)