Flattening dictionary with pd.json_normalize Flattening dictionary with pd.json_normalize json json

Flattening dictionary with pd.json_normalize


Setup

Your data is structured inconveniently. I want to focus on:

  1. Getting the lists in 'IDs' into a list of dictionaries, which would be far more convenient.
  2. Getting rid of the useless keys in the parent dictionary. All we care about are the values.

Your data:

{1: {'Name': 'Thrilling Tales of Dragon Slayers',     'IDs': {'StoreID': ['123445452543'],             'BookID': ['543533254353'],             'SalesID': ['543267765345']}}, 2: {'Name': 'boring Tales of Dragon Slayers',     'IDs': {'StoreID': ['111111', '1121111'],             'BookID': ['543533254353', '4324232342'],             'SalesID': ['543267765345', '4353543']}}}

What I want it to look like:

[{'Name': 'Thrilling Tales of Dragon Slayers',  'IDs': [{'StoreID': '123445452543',           'BookID': '543533254353',           'SalesID': '543267765345'}]}, {'Name': 'boring Tales of Dragon Slayers',  'IDs': [{'StoreID': '111111',           'BookID': '543533254353',           'SalesID': '543267765345'},          {'StoreID': '1121111',           'BookID': '4324232342',           'SalesID': '4353543'}]}]

Restructure Data

Reasonable Way

Simple loop, don't mess around. This gets us what I showed above

new = []for v in data.values():    temp = {**v}           # This is intended to keep all the other data that might be there    ids = temp.pop('IDs')  # I have to focus on this to create the records    temp['IDs'] = [dict(zip(ids, x)) for x in zip(*ids.values())]    new.append(temp)

Cute one-liner

new = [{**v, 'IDs': [dict(zip(v['IDs'], x)) for x in zip(*v['IDs'].values())]} for v in data.values()]

Create DataFrame with pd.json_normalize

In this call to json_normalize we need to specify the path to the records, i.e. the list of id dictionaries found at the 'IDs' key. json_normalize will create one row in the dataframe for every item in that list. This will be done with the the record_path parameter and we pass a tuple that describes the path (if it were in a deeper structure) or a string (if the key is at the top layer, which for us, it is).

record_path = 'IDs'

Then we want to tell json_normalize what keys are metadata for the records. If there are more than one record, as we have, then the metadata will be repeated for each record.

meta = 'Name'

So the final solution looks like this:

pd.json_normalize(new, record_path='IDs', meta='Name')        StoreID        BookID       SalesID                               Name0  123445452543  543533254353  543267765345  Thrilling Tales of Dragon Slayers1        111111  543533254353  543267765345     boring Tales of Dragon Slayers2       1121111    4324232342       4353543     boring Tales of Dragon Slayers

However

If we are restructuring anyway, might as well make it so we can just pass it to the dataframe constructor.

pd.DataFrame([    {'Name': r['Name'], **dict(zip(r['IDs'], x))}    for r in data.values() for x in zip(*r['IDs'].values())])                                Name       StoreID        BookID       SalesID0  Thrilling Tales of Dragon Slayers  123445452543  543533254353  5432677653451     boring Tales of Dragon Slayers        111111  543533254353  5432677653452     boring Tales of Dragon Slayers       1121111    4324232342       4353543

Bonus Content

While we are at it. The data is ambiguous in regards to whether or not each id type has the same number of ids. Suppose they did not.

data = {1:{      'Name': "Thrilling Tales of Dragon Slayers",      'IDs':{            "StoreID": ['123445452543'],            "BookID": ['543533254353'],            "SalesID": ['543267765345']}},     2:{      'Name': "boring Tales of Dragon Slayers",      'IDs':{            "StoreID": ['111111', '1121111'],            "BookID": ['543533254353', '4324232342'],            "SalesID": ['543267765345', '4353543', 'extra id']}}}

Then we can use zip_longest from itertools

from itertools import zip_longestpd.DataFrame([    {'Name': r['Name'], **dict(zip(r['IDs'], x))}    for r in data.values() for x in zip_longest(*r['IDs'].values())])                                Name       StoreID        BookID       SalesID0  Thrilling Tales of Dragon Slayers  123445452543  543533254353  5432677653451     boring Tales of Dragon Slayers        111111  543533254353  5432677653452     boring Tales of Dragon Slayers       1121111    4324232342       43535433     boring Tales of Dragon Slayers          None          None      extra id


  1. Use pandas.DataFrame.from_dict to read data
  2. Convert the values in the 'IDs' column to separate columns
    • .pop removes the old column from df
    • pd.DataFrame(df.pop('IDs').values.tolist()) converts each dict key to a separate column
    • .join the new columns back to df
  3. pd.Series.explode each list in the columns, with .apply.
  • Depending on the data, sometimes the solution is to reshape the data, as shown by piRSquared
import pandas as pd# test datadata =\{1: {'IDs': {'BookID': ['543533254353'],             'SalesID': ['543267765345'],             'StoreID': ['123445452543']},     'Name': 'Thrilling Tales of Dragon Slayers'}, 2: {'IDs': {'BookID': ['543533254353', '4324232342'],             'SalesID': ['543267765345', '4353543'],             'StoreID': ['111111', '1121111']},     'Name': 'boring Tales of Dragon Slayers'}}# load the data using from_dictdf = pd.DataFrame.from_dict(data, orient='index').reset_index(drop=True)# convert IDs to separate columnsdf = df.join(pd.DataFrame(df.pop('IDs').values.tolist()))# explode the list in each columndf = df.apply(pd.Series.explode).reset_index(drop=True)# display(df)                                Name        BookID       SalesID       StoreID0  Thrilling Tales of Dragon Slayers  543533254353  543267765345  1234454525431     boring Tales of Dragon Slayers  543533254353  543267765345        1111112     boring Tales of Dragon Slayers    4324232342       4353543       1121111