Flattening dictionary with pd.json_normalize
Setup
Your data is structured inconveniently. I want to focus on:
- Getting the lists in
'IDs'
into a list of dictionaries, which would be far more convenient. - Getting rid of the useless keys in the parent dictionary. All we care about are the values.
Your data
:
{1: {'Name': 'Thrilling Tales of Dragon Slayers', 'IDs': {'StoreID': ['123445452543'], 'BookID': ['543533254353'], 'SalesID': ['543267765345']}}, 2: {'Name': 'boring Tales of Dragon Slayers', 'IDs': {'StoreID': ['111111', '1121111'], 'BookID': ['543533254353', '4324232342'], 'SalesID': ['543267765345', '4353543']}}}
What I want it to look like:
[{'Name': 'Thrilling Tales of Dragon Slayers', 'IDs': [{'StoreID': '123445452543', 'BookID': '543533254353', 'SalesID': '543267765345'}]}, {'Name': 'boring Tales of Dragon Slayers', 'IDs': [{'StoreID': '111111', 'BookID': '543533254353', 'SalesID': '543267765345'}, {'StoreID': '1121111', 'BookID': '4324232342', 'SalesID': '4353543'}]}]
Restructure Data
Reasonable Way
Simple loop, don't mess around. This gets us what I showed above
new = []for v in data.values(): temp = {**v} # This is intended to keep all the other data that might be there ids = temp.pop('IDs') # I have to focus on this to create the records temp['IDs'] = [dict(zip(ids, x)) for x in zip(*ids.values())] new.append(temp)
Cute one-liner
new = [{**v, 'IDs': [dict(zip(v['IDs'], x)) for x in zip(*v['IDs'].values())]} for v in data.values()]
Create DataFrame
with pd.json_normalize
In this call to json_normalize
we need to specify the path to the records, i.e. the list of id dictionaries found at the 'IDs'
key. json_normalize
will create one row in the dataframe for every item in that list. This will be done with the the record_path
parameter and we pass a tuple
that describes the path (if it were in a deeper structure) or a string (if the key is at the top layer, which for us, it is).
record_path = 'IDs'
Then we want to tell json_normalize
what keys are metadata for the records. If there are more than one record, as we have, then the metadata will be repeated for each record.
meta = 'Name'
So the final solution looks like this:
pd.json_normalize(new, record_path='IDs', meta='Name') StoreID BookID SalesID Name0 123445452543 543533254353 543267765345 Thrilling Tales of Dragon Slayers1 111111 543533254353 543267765345 boring Tales of Dragon Slayers2 1121111 4324232342 4353543 boring Tales of Dragon Slayers
However
If we are restructuring anyway, might as well make it so we can just pass it to the dataframe constructor.
pd.DataFrame([ {'Name': r['Name'], **dict(zip(r['IDs'], x))} for r in data.values() for x in zip(*r['IDs'].values())]) Name StoreID BookID SalesID0 Thrilling Tales of Dragon Slayers 123445452543 543533254353 5432677653451 boring Tales of Dragon Slayers 111111 543533254353 5432677653452 boring Tales of Dragon Slayers 1121111 4324232342 4353543
Bonus Content
While we are at it. The data is ambiguous in regards to whether or not each id type has the same number of ids. Suppose they did not.
data = {1:{ 'Name': "Thrilling Tales of Dragon Slayers", 'IDs':{ "StoreID": ['123445452543'], "BookID": ['543533254353'], "SalesID": ['543267765345']}}, 2:{ 'Name': "boring Tales of Dragon Slayers", 'IDs':{ "StoreID": ['111111', '1121111'], "BookID": ['543533254353', '4324232342'], "SalesID": ['543267765345', '4353543', 'extra id']}}}
Then we can use zip_longest
from itertools
from itertools import zip_longestpd.DataFrame([ {'Name': r['Name'], **dict(zip(r['IDs'], x))} for r in data.values() for x in zip_longest(*r['IDs'].values())]) Name StoreID BookID SalesID0 Thrilling Tales of Dragon Slayers 123445452543 543533254353 5432677653451 boring Tales of Dragon Slayers 111111 543533254353 5432677653452 boring Tales of Dragon Slayers 1121111 4324232342 43535433 boring Tales of Dragon Slayers None None extra id
- Use
pandas.DataFrame.from_dict
to readdata
- Convert the values in the
'IDs'
column to separate columns.pop
removes the old column fromdf
pd.DataFrame(df.pop('IDs').values.tolist())
converts eachdict key
to a separate column.join
the new columns back todf
pd.Series.explode
eachlist
in the columns, with.apply
.
- Depending on the data, sometimes the solution is to reshape the data, as shown by piRSquared
import pandas as pd# test datadata =\{1: {'IDs': {'BookID': ['543533254353'], 'SalesID': ['543267765345'], 'StoreID': ['123445452543']}, 'Name': 'Thrilling Tales of Dragon Slayers'}, 2: {'IDs': {'BookID': ['543533254353', '4324232342'], 'SalesID': ['543267765345', '4353543'], 'StoreID': ['111111', '1121111']}, 'Name': 'boring Tales of Dragon Slayers'}}# load the data using from_dictdf = pd.DataFrame.from_dict(data, orient='index').reset_index(drop=True)# convert IDs to separate columnsdf = df.join(pd.DataFrame(df.pop('IDs').values.tolist()))# explode the list in each columndf = df.apply(pd.Series.explode).reset_index(drop=True)# display(df) Name BookID SalesID StoreID0 Thrilling Tales of Dragon Slayers 543533254353 543267765345 1234454525431 boring Tales of Dragon Slayers 543533254353 543267765345 1111112 boring Tales of Dragon Slayers 4324232342 4353543 1121111