pandas.io.json.json_normalize with very nested json
In the pandas example (below) what do the brackets mean? Is there a logic to be followed to go deeper with the []. [...]
result = json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])
Each string or list of strings in the ['state', 'shortname', ['info', 'governor']]
value is a path to an element to include, in addition to the selected rows. The second argument json_normalize()
argument (record_path
, set to 'counties'
in the documentation example) tells the function how to select elements from the input data structure that make up the rows in the output, and the meta
paths adds further metadata that will be included with each of the rows. Think of these as table joins in a database, if you will.
The input for the US States documentation example has two dictionaries in a list, and both of these dictionaries have a counties
key that references another list of dicts:
>>> data = [{'state': 'Florida',... 'shortname': 'FL',... 'info': {'governor': 'Rick Scott'},... 'counties': [{'name': 'Dade', 'population': 12345},... {'name': 'Broward', 'population': 40000},... {'name': 'Palm Beach', 'population': 60000}]},... {'state': 'Ohio',... 'shortname': 'OH',... 'info': {'governor': 'John Kasich'},... 'counties': [{'name': 'Summit', 'population': 1234},... {'name': 'Cuyahoga', 'population': 1337}]}]>>> pprint(data[0]['counties'])[{'name': 'Dade', 'population': 12345}, {'name': 'Broward', 'population': 40000}, {'name': 'Palm Beach', 'population': 60000}]>>> pprint(data[1]['counties'])[{'name': 'Summit', 'population': 1234}, {'name': 'Cuyahoga', 'population': 1337}]
Between them there are 5 rows of data to use in the output:
>>> json_normalize(data, 'counties') name population0 Dade 123451 Broward 400002 Palm Beach 600003 Summit 12344 Cuyahoga 1337
The meta
argument then names some elements that live next to those counties
lists, and those are then merged in separately. The values from the first data[0]
dictionary for those meta
elements are ('Florida', 'FL', 'Rick Scott')
, respectively, and for data[1]
the values are ('Ohio', 'OH', 'John Kasich')
, so you see those values attached to the counties
rows that came from the same top-level dictionary, repeated 3 and 2 times respectively:
>>> data[0]['state'], data[0]['shortname'], data[0]['info']['governor']('Florida', 'FL', 'Rick Scott')>>> data[1]['state'], data[1]['shortname'], data[1]['info']['governor']('Ohio', 'OH', 'John Kasich')>>> json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']]) name population state shortname info.governor0 Dade 12345 Florida FL Rick Scott1 Broward 40000 Florida FL Rick Scott2 Palm Beach 60000 Florida FL Rick Scott3 Summit 1234 Ohio OH John Kasich4 Cuyahoga 1337 Ohio OH John Kasich
So, if you pass in a list for the meta
argument, then each element in the list is a separate path, and each of those separate paths identifies data to add to the rows in the output.
In your example JSON, there are only a few nested lists to elevate with the first argument, like 'counties'
did in the example. The only example in that datastructure is the nested 'authors'
key; you'd have to extract each ['_source', 'authors']
path, after which you can add other keys from the parent object to augment those rows.
The second meta
argument then pulls in the _id
key from the outermost objects, followed by the nested ['_source', 'title']
and ['_source', 'journal']
nested paths.
The record_path
argument takes the authors
lists as the starting point, these look like:
>>> d['hits']['hits'][0]['_source']['authors'] # this value is None, and is skipped>>> d['hits']['hits'][1]['_source']['authors'][{'affiliations': ['Punjabi University'], 'author_id': '780E3459', 'author_name': 'munish puri'}, {'affiliations': ['Punjabi University'], 'author_id': '48D92C79', 'author_name': 'rajesh dhaliwal'}, {'affiliations': ['Punjabi University'], 'author_id': '7D9BD37C', 'author_name': 'r s singh'}]>>> d['hits']['hits'][2]['_source']['authors'][{'author_id': '7FF872BC', 'author_name': 'barbara eileen ryan'}]>>> # etc.
and so gives you the following rows:
>>> json_normalize(d['hits']['hits'], ['_source', 'authors']) affiliations author_id author_name0 [Punjabi University] 780E3459 munish puri1 [Punjabi University] 48D92C79 rajesh dhaliwal2 [Punjabi University] 7D9BD37C r s singh3 NaN 7FF872BC barbara eileen ryan4 NaN 0299B8E9 fraser j harbutt5 NaN 7DAB7B72 richard m freeland
and then we can use the third meta
argument to add more columns like _id
, _source.title
and _source.journal
, using ['_id', ['_source', 'journal'], ['_source', 'title']]
:
>>> json_normalize(... data['hits']['hits'],... ['_source', 'authors'],... ['_id', ['_source', 'journal'], ['_source', 'title']]... ) affiliations author_id author_name _id \0 [Punjabi University] 780E3459 munish puri 7AF8EBC3 1 [Punjabi University] 48D92C79 rajesh dhaliwal 7AF8EBC32 [Punjabi University] 7D9BD37C r s singh 7AF8EBC33 NaN 7FF872BC barbara eileen ryan 7521A7214 NaN 0299B8E9 fraser j harbutt 7DAEB9A45 NaN 7DAB7B72 richard m freeland 7B3236C5 _source.journal0 Journal of Industrial Microbiology & Biotechno...1 Journal of Industrial Microbiology & Biotechno...2 Journal of Industrial Microbiology & Biotechno...3 The American Historical Review4 The American Historical Review5 The American Historical Review _source.title \0 Development of a stable continuous flow immobi...1 Development of a stable continuous flow immobi...2 Development of a stable continuous flow immobi...3 Feminism and the women's movement : dynamics o...4 The iron curtain : Churchill, America, and the...5 The Truman Doctrine and the origins of McCarth...
You can also have a look at the library flatten_json, which does not require you to write column hierarchies as in json_normalize:
from flatten_json import flattendata = d['hits']['hits']dict_flattened = (flatten(record, '.') for record in data)df = pd.DataFrame(dict_flattened)print(df)