Convert nested JSON to CSV file in Python
Please scroll down for the newer, faster solution
This is an older question, but I struggled the entire night to get a satisfactory result for a similar situation, and I came up with this:
import jsonimport pandasdef cross_join(left, right): return left.assign(key=1).merge(right.assign(key=1), on='key', how='outer').drop('key', 1)def json_to_dataframe(data_in): def to_frame(data, prev_key=None): if isinstance(data, dict): df = pandas.DataFrame() for key in data: df = cross_join(df, to_frame(data[key], prev_key + '.' + key)) elif isinstance(data, list): df = pandas.DataFrame() for i in range(len(data)): df = pandas.concat([df, to_frame(data[i], prev_key)]) else: df = pandas.DataFrame({prev_key[1:]: [data]}) return df return to_frame(data_in)if __name__ == '__main__': with open('somefile') as json_file: json_data = json.load(json_file) df = json_to_dataframe(json_data) df.to_csv('data.csv', mode='w')
Explanation:
The cross_join function is a neat way I found to do a cartesian product. (credit: here)
The json_to_dataframe function does the logic, using pandas dataframes. In my case, the json was deeply nested, and I wanted to split dictionary key:value pairs into columns, but the lists I wanted to transform into rows for a column -- hence the concat -- which I then cross join with the upper level, thus multiplying the records number so that each value from the list has its own row, while the previous columns are identical.
The recursiveness creates stacks that cross join with the one below, until the last one is returned.
Then with the dataframe in a table format, it's easy to convert to CSV with the "df.to_csv()" dataframe object method.
This should work with deeply nested JSON, being able to normalize all of it into rows by the logic described above.
I hope this will help someone, someday. Just trying to give back to this awesome community.
---------------------------------------------------------------------------------------------
LATER EDIT: NEW SOLUTION
I'm coming back to this as while the dataframe option kinda worked, it took the app minutes to parse not so large JSON data. Therefore I thought of doing what the dataframes do, but by myself:
from copy import deepcopyimport pandasdef cross_join(left, right): new_rows = [] if right else left for left_row in left: for right_row in right: temp_row = deepcopy(left_row) for key, value in right_row.items(): temp_row[key] = value new_rows.append(deepcopy(temp_row)) return new_rowsdef flatten_list(data): for elem in data: if isinstance(elem, list): yield from flatten_list(elem) else: yield elemdef json_to_dataframe(data_in): def flatten_json(data, prev_heading=''): if isinstance(data, dict): rows = [{}] for key, value in data.items(): rows = cross_join(rows, flatten_json(value, prev_heading + '.' + key)) elif isinstance(data, list): rows = [] for i in range(len(data)): [rows.append(elem) for elem in flatten_list(flatten_json(data[i], prev_heading))] else: rows = [{prev_heading[1:]: data}] return rows return pandas.DataFrame(flatten_json(data_in))if __name__ == '__main__': json_data = { "id": "0001", "type": "donut", "name": "Cake", "ppu": 0.55, "batters": { "batter": [ {"id": "1001", "type": "Regular"}, {"id": "1002", "type": "Chocolate"}, {"id": "1003", "type": "Blueberry"}, {"id": "1004", "type": "Devil's Food"} ] }, "topping": [ {"id": "5001", "type": "None"}, {"id": "5002", "type": "Glazed"}, {"id": "5005", "type": "Sugar"}, {"id": "5007", "type": "Powdered Sugar"}, {"id": "5006", "type": "Chocolate with Sprinkles"}, {"id": "5003", "type": "Chocolate"}, {"id": "5004", "type": "Maple"} ], "something": [] } df = json_to_dataframe(json_data) print(df)
OUTPUT:
id type name ppu batters.batter.id batters.batter.type topping.id topping.type0 0001 donut Cake 0.55 1001 Regular 5001 None1 0001 donut Cake 0.55 1001 Regular 5002 Glazed2 0001 donut Cake 0.55 1001 Regular 5005 Sugar3 0001 donut Cake 0.55 1001 Regular 5007 Powdered Sugar4 0001 donut Cake 0.55 1001 Regular 5006 Chocolate with Sprinkles5 0001 donut Cake 0.55 1001 Regular 5003 Chocolate6 0001 donut Cake 0.55 1001 Regular 5004 Maple7 0001 donut Cake 0.55 1002 Chocolate 5001 None8 0001 donut Cake 0.55 1002 Chocolate 5002 Glazed9 0001 donut Cake 0.55 1002 Chocolate 5005 Sugar10 0001 donut Cake 0.55 1002 Chocolate 5007 Powdered Sugar11 0001 donut Cake 0.55 1002 Chocolate 5006 Chocolate with Sprinkles12 0001 donut Cake 0.55 1002 Chocolate 5003 Chocolate13 0001 donut Cake 0.55 1002 Chocolate 5004 Maple14 0001 donut Cake 0.55 1003 Blueberry 5001 None15 0001 donut Cake 0.55 1003 Blueberry 5002 Glazed16 0001 donut Cake 0.55 1003 Blueberry 5005 Sugar17 0001 donut Cake 0.55 1003 Blueberry 5007 Powdered Sugar18 0001 donut Cake 0.55 1003 Blueberry 5006 Chocolate with Sprinkles19 0001 donut Cake 0.55 1003 Blueberry 5003 Chocolate20 0001 donut Cake 0.55 1003 Blueberry 5004 Maple21 0001 donut Cake 0.55 1004 Devil's Food 5001 None22 0001 donut Cake 0.55 1004 Devil's Food 5002 Glazed23 0001 donut Cake 0.55 1004 Devil's Food 5005 Sugar24 0001 donut Cake 0.55 1004 Devil's Food 5007 Powdered Sugar25 0001 donut Cake 0.55 1004 Devil's Food 5006 Chocolate with Sprinkles26 0001 donut Cake 0.55 1004 Devil's Food 5003 Chocolate27 0001 donut Cake 0.55 1004 Devil's Food 5004 Maple
As per what the above does, well, the cross_join function does pretty much the same thing as in the dataframe solution, but without dataframes, thus being faster.
I added the flatten_list generator as I wanted to make sure that the JSON arrays are all nice and flattened, then provided as a single list of dictionaries comprising of the previous key from one iteration before assigned to each of the list's values. This pretty much mimics the pandas.concat behaviour in this case.
The logic in the main function, json_to_dataframe is then the same as before. All that needed to change was having the operations performed by dataframes as coded functions.
Also, in the dataframes solution I was not appending the previous heading to the nested object, but unless you are 100% sure you do not have conflicts in column names, then it is pretty much mandatory.
I hope this helps :).
EDIT: Modified the cross_join function to deal with the case when a nested list is empty, basically maintaining the previous result set unmodified. The output is unchanged even after adding the empty JSON list in the example JSON data. Thank you, @Nazmus Sakib for pointing it out.
For the JSON data you have given, you could do this by parsing the JSON structure to just return a list of all the leaf nodes.
This assumes that your structure is consistent throughout, if each entry can have different fields, see the second approach.
For example:
import jsonimport csvdef get_leaves(item, key=None): if isinstance(item, dict): leaves = [] for i in item.keys(): leaves.extend(get_leaves(item[i], i)) return leaves elif isinstance(item, list): leaves = [] for i in item: leaves.extend(get_leaves(i, key)) return leaves else: return [(key, item)]with open('json.txt') as f_input, open('output.csv', 'w', newline='') as f_output: csv_output = csv.writer(f_output) write_header = True for entry in json.load(f_input): leaf_entries = sorted(get_leaves(entry)) if write_header: csv_output.writerow([k for k, v in leaf_entries]) write_header = False csv_output.writerow([v for k, v in leaf_entries])
If your JSON data is a list of entries in the format you have given, then you should get output as follows:
address_line_1,company_number,country_of_residence,etag,forename,kind,locality,middle_name,month,name,nationality,natures_of_control,notified_on,postal_code,premises,region,self,surname,title,yearAddress 1,12345678,England,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,John,individual-person-with-significant-control,Henley-On-Thames,M,2,John M Smith,Vietnamese,ownership-of-shares-50-to-75-percent,2016-04-06,RG9 1DP,161,Oxfordshire,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,Smith,Mrs,1977Address 1,12345679,England,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,John,individual-person-with-significant-control,Henley-On-Thames,M,2,John M Smith,Vietnamese,ownership-of-shares-50-to-75-percent,2016-04-06,RG9 1DP,161,Oxfordshire,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,Smith,Mrs,1977
If each entry can contain different (or possibly missing) fields, then a better approach would be to use a DictWriter
. In this case, all of the entries would need to be processed to determine the complete list of possible fieldnames
so that the correct header can be written.
import jsonimport csvdef get_leaves(item, key=None): if isinstance(item, dict): leaves = {} for i in item.keys(): leaves.update(get_leaves(item[i], i)) return leaves elif isinstance(item, list): leaves = {} for i in item: leaves.update(get_leaves(i, key)) return leaves else: return {key : item}with open('json.txt') as f_input: json_data = json.load(f_input)# First parse all entries to get the complete fieldname listfieldnames = set()for entry in json_data: fieldnames.update(get_leaves(entry).keys())with open('output.csv', 'w', newline='') as f_output: csv_output = csv.DictWriter(f_output, fieldnames=sorted(fieldnames)) csv_output.writeheader() csv_output.writerows(get_leaves(entry) for entry in json_data)
You can use the pandas library json_normalize function to flatten the struct, and then deal with it as you please. For example:
import pandas as pdimport jsonraw = """[{ "company_number": "12345678", "data": { "address": { "address_line_1": "Address 1", "locality": "Henley-On-Thames", "postal_code": "RG9 1DP", "premises": "161", "region": "Oxfordshire" }, "country_of_residence": "England", "date_of_birth": { "month": 2, "year": 1977 }, "etag": "26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00", "kind": "individual-person-with-significant-control", "links": { "self": "/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl" }, "name": "John M Smith", "name_elements": { "forename": "John", "middle_name": "M", "surname": "Smith", "title": "Mrs" }, "nationality": "Vietnamese", "natures_of_control": [ "ownership-of-shares-50-to-75-percent" ], "notified_on": "2016-04-06" }}]"""data = json.loads(raw)data = pd.json_normalize(data)print(data.to_csv())
Which gives you:
,company_number,data.address.address_line_1,data.address.locality,data.address.postal_code,data.address.premises,data.address.region,data.country_of_residence,data.date_of_birth.month,data.date_of_birth.year,data.etag,data.kind,data.links.self,data.name,data.name_elements.forename,data.name_elements.middle_name,data.name_elements.surname,data.name_elements.title,data.nationality,data.natures_of_control,data.notified_on0,12345678,Address 1,Henley-On-Thames,RG9 1DP,161,Oxfordshire,England,2,1977,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,individual-person-with-significant-control,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,John M Smith,John,M,Smith,Mrs,Vietnamese,['ownership-of-shares-50-to-75-percent'],2016-04-06