pandas.io.json.json_normalize with very nested json pandas.io.json.json_normalize with very nested json python python

pandas.io.json.json_normalize with very nested json


In the pandas example (below) what do the brackets mean? Is there a logic to be followed to go deeper with the []. [...]

result = json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])

Each string or list of strings in the ['state', 'shortname', ['info', 'governor']] value is a path to an element to include, in addition to the selected rows. The second argument json_normalize() argument (record_path, set to 'counties' in the documentation example) tells the function how to select elements from the input data structure that make up the rows in the output, and the meta paths adds further metadata that will be included with each of the rows. Think of these as table joins in a database, if you will.

The input for the US States documentation example has two dictionaries in a list, and both of these dictionaries have a counties key that references another list of dicts:

>>> data = [{'state': 'Florida',...          'shortname': 'FL',...         'info': {'governor': 'Rick Scott'},...         'counties': [{'name': 'Dade', 'population': 12345},...                      {'name': 'Broward', 'population': 40000},...                      {'name': 'Palm Beach', 'population': 60000}]},...         {'state': 'Ohio',...          'shortname': 'OH',...          'info': {'governor': 'John Kasich'},...          'counties': [{'name': 'Summit', 'population': 1234},...                       {'name': 'Cuyahoga', 'population': 1337}]}]>>> pprint(data[0]['counties'])[{'name': 'Dade', 'population': 12345}, {'name': 'Broward', 'population': 40000}, {'name': 'Palm Beach', 'population': 60000}]>>> pprint(data[1]['counties'])[{'name': 'Summit', 'population': 1234}, {'name': 'Cuyahoga', 'population': 1337}]

Between them there are 5 rows of data to use in the output:

>>> json_normalize(data, 'counties')         name  population0        Dade       123451     Broward       400002  Palm Beach       600003      Summit        12344    Cuyahoga        1337

The meta argument then names some elements that live next to those counties lists, and those are then merged in separately. The values from the first data[0] dictionary for those meta elements are ('Florida', 'FL', 'Rick Scott'), respectively, and for data[1] the values are ('Ohio', 'OH', 'John Kasich'), so you see those values attached to the counties rows that came from the same top-level dictionary, repeated 3 and 2 times respectively:

>>> data[0]['state'], data[0]['shortname'], data[0]['info']['governor']('Florida', 'FL', 'Rick Scott')>>> data[1]['state'], data[1]['shortname'], data[1]['info']['governor']('Ohio', 'OH', 'John Kasich')>>> json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])         name  population    state shortname info.governor0        Dade       12345  Florida        FL    Rick Scott1     Broward       40000  Florida        FL    Rick Scott2  Palm Beach       60000  Florida        FL    Rick Scott3      Summit        1234     Ohio        OH   John Kasich4    Cuyahoga        1337     Ohio        OH   John Kasich

So, if you pass in a list for the meta argument, then each element in the list is a separate path, and each of those separate paths identifies data to add to the rows in the output.

In your example JSON, there are only a few nested lists to elevate with the first argument, like 'counties' did in the example. The only example in that datastructure is the nested 'authors' key; you'd have to extract each ['_source', 'authors'] path, after which you can add other keys from the parent object to augment those rows.

The second meta argument then pulls in the _id key from the outermost objects, followed by the nested ['_source', 'title'] and ['_source', 'journal'] nested paths.

The record_path argument takes the authors lists as the starting point, these look like:

>>> d['hits']['hits'][0]['_source']['authors']   # this value is None, and is skipped>>> d['hits']['hits'][1]['_source']['authors'][{'affiliations': ['Punjabi University'],  'author_id': '780E3459',  'author_name': 'munish puri'}, {'affiliations': ['Punjabi University'],  'author_id': '48D92C79',  'author_name': 'rajesh dhaliwal'}, {'affiliations': ['Punjabi University'],  'author_id': '7D9BD37C',  'author_name': 'r s singh'}]>>> d['hits']['hits'][2]['_source']['authors'][{'author_id': '7FF872BC',  'author_name': 'barbara eileen ryan'}]>>> # etc.

and so gives you the following rows:

>>> json_normalize(d['hits']['hits'], ['_source', 'authors'])           affiliations author_id          author_name0  [Punjabi University]  780E3459          munish puri1  [Punjabi University]  48D92C79      rajesh dhaliwal2  [Punjabi University]  7D9BD37C            r s singh3                   NaN  7FF872BC  barbara eileen ryan4                   NaN  0299B8E9     fraser j harbutt5                   NaN  7DAB7B72   richard m freeland

and then we can use the third meta argument to add more columns like _id, _source.title and _source.journal, using ['_id', ['_source', 'journal'], ['_source', 'title']]:

>>> json_normalize(...     data['hits']['hits'],...     ['_source', 'authors'],...     ['_id', ['_source', 'journal'], ['_source', 'title']]... )           affiliations author_id          author_name       _id   \0  [Punjabi University]  780E3459          munish puri  7AF8EBC3  1  [Punjabi University]  48D92C79      rajesh dhaliwal  7AF8EBC32  [Punjabi University]  7D9BD37C            r s singh  7AF8EBC33                   NaN  7FF872BC  barbara eileen ryan  7521A7214                   NaN  0299B8E9     fraser j harbutt  7DAEB9A45                   NaN  7DAB7B72   richard m freeland  7B3236C5                                     _source.journal0  Journal of Industrial Microbiology & Biotechno...1  Journal of Industrial Microbiology & Biotechno...2  Journal of Industrial Microbiology & Biotechno...3                     The American Historical Review4                     The American Historical Review5                     The American Historical Review                                       _source.title  \0  Development of a stable continuous flow immobi...1  Development of a stable continuous flow immobi...2  Development of a stable continuous flow immobi...3  Feminism and the women's movement : dynamics o...4  The iron curtain : Churchill, America, and the...5  The Truman Doctrine and the origins of McCarth...


You can also have a look at the library flatten_json, which does not require you to write column hierarchies as in json_normalize:

from flatten_json import flattendata = d['hits']['hits']dict_flattened = (flatten(record, '.') for record in data)df = pd.DataFrame(dict_flattened)print(df)

See https://github.com/amirziai/flatten.