flattening nested Json in pandas data frame
If you are looking for a more general way to unfold multiple hierarchies from a json you can use recursion
and list comprehension to reshape your data. One alternative is presented below:
def flatten_json(nested_json, exclude=['']): """Flatten json object with nested keys into a single level. Args: nested_json: A nested json object. exclude: Keys to exclude from output. Returns: The flattened json object if successful, None otherwise. """ out = {} def flatten(x, name='', exclude=exclude): if type(x) is dict: for a in x: if a not in exclude: flatten(x[a], name + a + '_') elif type(x) is list: i = 0 for a in x: flatten(a, name + str(i) + '_') i += 1 else: out[name[:-1]] = x flatten(nested_json) return out
Then you can apply to your data, independent of nested levels:
New sample data
this_dict = {'events': [ {'id': 142896214, 'playerId': 37831, 'teamId': 3157, 'matchId': 2214569, 'matchPeriod': '1H', 'eventSec': 0.8935539999999946, 'eventId': 8, 'eventName': 'Pass', 'subEventId': 85, 'subEventName': 'Simple pass', 'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53}], 'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}, {'id': 142896214, 'playerId': 37831, 'teamId': 3157, 'matchId': 2214569, 'matchPeriod': '1H', 'eventSec': 0.8935539999999946, 'eventId': 8, 'eventName': 'Pass', 'subEventId': 85, 'subEventName': 'Simple pass', 'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53},{'x': 51, 'y': 49}], 'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}]}
Usage
pd.DataFrame([flatten_json(x) for x in this_dict['events']])Out[1]: id playerId teamId matchId matchPeriod eventSec eventId \0 142896214 37831 3157 2214569 1H 0.893554 8 1 142896214 37831 3157 2214569 1H 0.893554 8 eventName subEventId subEventName positions_0_x positions_0_y \0 Pass 85 Simple pass 51 49 1 Pass 85 Simple pass 51 49 positions_1_x positions_1_y tags_0_id tags_0_tag_label positions_2_x \0 40 53 1801 accurate NaN 1 40 53 1801 accurate 51.0 positions_2_y 0 NaN 1 49.0
Note that this flatten_json
code is not mine, I have seen it here and here without much certainty of the original source.
- As noted in the accepted answer,
flatten_json
can be a great option, depending on the structure of the JSON, and how the structure should be flattened.- In this case the OP wants all the values for 1 event, to be on a single row, so
flatten_json
works - If the desired result is for each position in
positions
to have a separate row, thenpandas.json_normalize
is the better option.
- In this case the OP wants all the values for 1 event, to be on a single row, so
- An issue with
flatten_json
is, if there are manypositions
, then the number of columns for each event inevents
can be very large. - See How to flatten a nested JSON recursively, with flatten_json? for a more thorough explanation if using
flatten_json
.
Create 1 row for each dict
in events
data = {'events': [{'id': 142896214, 'playerId': 37831, 'teamId': 3157, 'matchId': 2214569, 'matchPeriod': '1H', 'eventSec': 0.8935539999999946, 'eventId': 8, 'eventName': 'Pass', 'subEventId': 85, 'subEventName': 'Simple pass', 'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53}], 'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}]}
Create the DataFrame
df = pd.DataFrame.from_dict(data)df = df['events'].apply(pd.Series)
Flatten positions
with pd.Series
df_p = df['positions'].apply(pd.Series)df_p_0 = df_p[0].apply(pd.Series)df_p_1 = df_p[1].apply(pd.Series)
Rename positions[0]
& positions[1]
:
df_p_0.columns = ['pos_0_x', 'pos_0_y']df_p_1.columns = ['pos_1_x', 'pos_1_y']
Flatten tags
with pd.Series
:
df_t = df.tags.apply(pd.Series)df_t = df_t[0].apply(pd.Series)df_t_t = df_t.tag.apply(pd.Series)
Rename id
& label
:
df_t = df_t.rename(columns={'id': 'tags_id'})df_t_t.columns = ['tags_tag_label']
Combine them all with pd.concat
:
df_new = pd.concat([df, df_p_0, df_p_1, df_t.tags_id, df_t_t], axis=1)
Drop the old columns:
df_new = df_new.drop(['positions', 'tags'], axis=1)
Create a separate row for each position in positions
# normalize eventsdf = pd.json_normalize(data, 'events')# explode all columns with lists of dictsdf = df.apply(lambda x: x.explode()).reset_index(drop=True)# list of columns with dictscols_to_normalize = ['positions', 'tags']# if there are keys, which will become column names, overlap with excising column names# add the current column name as a prefixnormalized = list()for col in cols_to_normalize: d = pd.json_normalize(df[col], sep='_') d.columns = [f'{col}_{v}' for v in d.columns] normalized.append(d.copy())# combine df with the normalized columnsdf = pd.concat([df] + normalized, axis=1).drop(columns=cols_to_normalize)# display(df) id playerId teamId matchId matchPeriod eventSec eventId eventName subEventId subEventName positions_x positions_y tags_id tags_tag_label0 142896214 37831 3157 2214569 1H 0.893554 8 Pass 85 Simple pass 51 49 1801 accurate1 142896214 37831 3157 2214569 1H 0.893554 8 Pass 85 Simple pass 40 53 1801 accurate