flattening nested Json in pandas data frame flattening nested Json in pandas data frame pandas pandas

flattening nested Json in pandas data frame


If you are looking for a more general way to unfold multiple hierarchies from a json you can use recursion and list comprehension to reshape your data. One alternative is presented below:

def flatten_json(nested_json, exclude=['']):    """Flatten json object with nested keys into a single level.        Args:            nested_json: A nested json object.            exclude: Keys to exclude from output.        Returns:            The flattened json object if successful, None otherwise.    """    out = {}    def flatten(x, name='', exclude=exclude):        if type(x) is dict:            for a in x:                if a not in exclude: flatten(x[a], name + a + '_')        elif type(x) is list:            i = 0            for a in x:                flatten(a, name + str(i) + '_')                i += 1        else:            out[name[:-1]] = x    flatten(nested_json)    return out

Then you can apply to your data, independent of nested levels:

New sample data

this_dict = {'events': [  {'id': 142896214,   'playerId': 37831,   'teamId': 3157,   'matchId': 2214569,   'matchPeriod': '1H',   'eventSec': 0.8935539999999946,   'eventId': 8,   'eventName': 'Pass',   'subEventId': 85,   'subEventName': 'Simple pass',   'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53}],   'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}, {'id': 142896214,   'playerId': 37831,   'teamId': 3157,   'matchId': 2214569,   'matchPeriod': '1H',   'eventSec': 0.8935539999999946,   'eventId': 8,   'eventName': 'Pass',   'subEventId': 85,   'subEventName': 'Simple pass',   'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53},{'x': 51, 'y': 49}],   'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}]}

Usage

pd.DataFrame([flatten_json(x) for x in this_dict['events']])Out[1]:          id  playerId  teamId  matchId matchPeriod  eventSec  eventId  \0  142896214     37831    3157  2214569          1H  0.893554        8   1  142896214     37831    3157  2214569          1H  0.893554        8     eventName  subEventId subEventName  positions_0_x  positions_0_y  \0      Pass          85  Simple pass             51             49   1      Pass          85  Simple pass             51             49      positions_1_x  positions_1_y  tags_0_id tags_0_tag_label  positions_2_x  \0             40             53       1801         accurate            NaN   1             40             53       1801         accurate           51.0      positions_2_y  0            NaN  1           49.0  

Note that this flatten_json code is not mine, I have seen it here and here without much certainty of the original source.


  • As noted in the accepted answer, flatten_json can be a great option, depending on the structure of the JSON, and how the structure should be flattened.
    • In this case the OP wants all the values for 1 event, to be on a single row, so flatten_json works
    • If the desired result is for each position in positions to have a separate row, then pandas.json_normalize is the better option.
  • An issue with flatten_json is, if there are many positions, then the number of columns for each event in events can be very large.
  • See How to flatten a nested JSON recursively, with flatten_json? for a more thorough explanation if using flatten_json.

Create 1 row for each dict in events

data = {'events': [{'id': 142896214,                    'playerId': 37831,                    'teamId': 3157,                    'matchId': 2214569,                    'matchPeriod': '1H',                    'eventSec': 0.8935539999999946,                    'eventId': 8,                    'eventName': 'Pass',                    'subEventId': 85,                    'subEventName': 'Simple pass',                    'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53}],                    'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}]}

Create the DataFrame

df = pd.DataFrame.from_dict(data)df = df['events'].apply(pd.Series)

enter image description here

Flatten positions with pd.Series

df_p = df['positions'].apply(pd.Series)df_p_0 = df_p[0].apply(pd.Series)df_p_1 = df_p[1].apply(pd.Series)

Rename positions[0] & positions[1]:

df_p_0.columns = ['pos_0_x', 'pos_0_y']df_p_1.columns = ['pos_1_x', 'pos_1_y']

Flatten tags with pd.Series:

df_t = df.tags.apply(pd.Series)df_t = df_t[0].apply(pd.Series)df_t_t = df_t.tag.apply(pd.Series)

Rename id & label:

df_t =  df_t.rename(columns={'id': 'tags_id'})df_t_t.columns = ['tags_tag_label']

Combine them all with pd.concat:

df_new = pd.concat([df, df_p_0, df_p_1, df_t.tags_id, df_t_t], axis=1)

Drop the old columns:

df_new = df_new.drop(['positions', 'tags'], axis=1)

enter image description here

Create a separate row for each position in positions

# normalize eventsdf = pd.json_normalize(data, 'events')# explode all columns with lists of dictsdf = df.apply(lambda x: x.explode()).reset_index(drop=True)# list of columns with dictscols_to_normalize = ['positions', 'tags']# if there are keys, which will become column names, overlap with excising column names# add the current column name as a prefixnormalized = list()for col in cols_to_normalize:        d = pd.json_normalize(df[col], sep='_')    d.columns = [f'{col}_{v}' for v in d.columns]    normalized.append(d.copy())# combine df with the normalized columnsdf = pd.concat([df] + normalized, axis=1).drop(columns=cols_to_normalize)# display(df)          id  playerId  teamId  matchId matchPeriod  eventSec  eventId eventName  subEventId subEventName  positions_x  positions_y  tags_id tags_tag_label0  142896214     37831    3157  2214569          1H  0.893554        8      Pass          85  Simple pass           51           49     1801       accurate1  142896214     37831    3157  2214569          1H  0.893554        8      Pass          85  Simple pass           40           53     1801       accurate