How to build a JSON file with nested records from a flat data table? How to build a JSON file with nested records from a flat data table? pandas pandas

How to build a JSON file with nested records from a flat data table?


This is the a solution that works and creates the desired JSON format. First, I grouped my dataframe by the appropriate columns, then instead of creating a dictionary (and losing data order) for each column heading/record pair, I created them as lists of tuples, then transformed the list into an Ordered Dict. Another Ordered Dict was created for the two columns that everything else was grouped by. Precise layering between lists and ordered dicts was necessary to for the JSON conversion to produce the correct format. Also note that when dumping to JSON, sort_keys must be set to false, or all your Ordered Dicts will be rearranged into alphabetical order.

import pandasimport jsonfrom collections import OrderedDictinputExcel = 'E:\\teams.xlsx'exportJson = 'E:\\teams.json'data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')# This creates a tuple of column headings for later use matching them with column datacols = []columnList = list(data[0:])for col in columnList:    cols.append(str(col))columnList = tuple(cols)#This groups the dataframe by the 'teamname' and 'members' columnsgrouped = data.groupby(['teamname', 'members']).first()#This creates a reference to the index level of the groupsgroupnames = data.groupby(["teamname", "members"]).grouper.levelstm = (groupnames[0])#Create a list to add team records to at the end of the first 'for' loopteamsList = []for teamN in tm:    teamN = int(teamN)  #added this in to prevent TypeError: 1 is not JSON serializable    tempList = []   #Create an temporary list to add each record to    for index, row in grouped.iterrows():        dataRow = row        if index[0] == teamN:  #Select the record in each row of the grouped dataframe if its index matches the team number            #In order to have the JSON records come out in the same order, I had to first create a list of tuples, then convert to and Ordered Dict            rowDict = ([(columnList[2], dataRow[0]), (columnList[3], dataRow[1]), (columnList[4], dataRow[2]), (columnList[5], dataRow[3]), (columnList[6], dataRow[4]), (columnList[7], dataRow[5])])            rowDict = OrderedDict(rowDict)            tempList.append(rowDict)    #Create another Ordered Dict to keep 'teamname' and the list of members from the temporary list sorted    t = ([('teamname', str(teamN)), ('members', tempList)])    t= OrderedDict(t)    #Append the Ordered Dict to the emepty list of teams created earlier    ListX = t    teamsList.append(ListX)#Create a final dictionary with a single item: the list of teamsteams = {"teams":teamsList} #Dump to JSON formatformattedJson = json.dumps(teams, indent = 1, sort_keys = False) #sort_keys MUST be set to False, or all dictionaries will be alphebetizedformattedJson = formattedJson.replace("NaN", '"NULL"') #"NaN" is the NULL format in pandas dataframes - must be replaced with "NULL" to be a valid JSON fileprint formattedJson#Export to JSON fileparsed = open(exportJson, "w")parsed.write(formattedJson)print"\n\nExport to JSON Complete"


With some input from @root I used a different tack and came up with the following code, which seems to get most of the way there:

import pandasimport jsonfrom collections import defaultdictinputExcel = 'E:\\teamsMM.xlsx'exportJson = 'E:\\teamsMM.json'data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')grouped = data.groupby(['teamname', 'members']).first()results = defaultdict(lambda: defaultdict(dict))for t in grouped.itertuples():    for i, key in enumerate(t.Index):        if i ==0:            nested = results[key]        elif i == len(t.Index) -1:            nested[key] = t        else:            nested = nested[key]formattedJson = json.dumps(results, indent = 4)formattedJson = '{\n"teams": [\n' + formattedJson +'\n]\n }'parsed = open(exportJson, "w")parsed.write(formattedJson)

The resulting JSON file is this:

{"teams": [{    "1": {        "0": [            [                1,                 0            ],             "John",             "Doe",             "Anon",             "916-555-1234",             "none",             "john.doe@wildlife.net"        ],         "1": [            [                1,                 1            ],             "Jane",             "Doe",             "Anon",             "916-555-4321",             "916-555-7890",             "jane.doe@wildlife.net"        ]    },     "2": {        "0": [            [                2,                 0            ],             "Mickey",             "Moose",             "Moosers",             "916-555-0000",             "916-555-1111",             "mickey.moose@wildlife.net"        ],         "1": [            [                2,                 1            ],             "Minny",             "Moose",             "Moosers",             "916-555-2222",             "none",             "minny.moose@wildlife.net"        ]    }}] }

This format is very close to the desired end product. Remaining issues are: removing the redundant array [1, 0] that appears just above each firstname, and getting the headers for each nest to be "teamname": "1","members": rather than "1": "0":

Also, I do not know why each record is being stripped of its heading on the conversion. For instance why is dictionary entry "firstname":"John" exported as "John".