AWS Athena export array of structs to JSON AWS Athena export array of structs to JSON json json

AWS Athena export array of structs to JSON


I have skimmed through all the documentation and unfortunately there seems to be no way to do this as of now. The only possible workaround is

converting a struct to a json when querying athena

SELECT  my_field,  my_field.a,  my_field.b,  my_field.c.d,  my_field.c.eFROM   my_table

Or I would convert the data to json using post processing. Below script shows how

#!/usr/bin/env pythonimport ioimport repattern1 = re.compile(r'(?<={)([a-z]+)=', re.I)pattern2 = re.compile(r':([a-z][^,{}. [\]]+)', re.I)pattern3 = re.compile(r'\\"', re.I)with io.open("test.csv") as f:    headers = list(map(lambda f: f.strip(), f.readline().split(",")))    for line in f.readlines():        orig_line = line        data = []        for i, l in enumerate(line.split('","')):            data.append(headers[i] + ":" + re.sub('^"|"$', "", l))        line = "{" + ','.join(data) + "}"        line = pattern1.sub(r'"\1":', line)        line = pattern2.sub(r':"\1"', line)        print(line)

The output on your input data is

{"timestamp":1.520640777666096E9,"stats":[{"time":15.0, "mean":45.23, "var":0.31}, {"time":19.0, "mean":17.315, "var":2.612}],"dets":[{"coords":[2.4, 1.7, 0.3], "header":{"frame":1, "seq":1, "name":"hello"}}],"pos":{"x":5.0, "y":1.4, "theta":0.04}}

Which is a valid JSON

Converted JSON


The python code from @tarun almost got me there, but I had to modify it in several ways due to my data. In particular, I have:

  • json structures saved in Athena as strings
  • Strings that contain multiple words, and therefore need to be in between double quotes. Some of them contain "[]" and "{}" symbols.

Here is the code that worked for me, hopefully will be useful for others:

#!/usr/bin/env pythonimport ioimport re, syspattern1 = re.compile(r'(?<={)([a-z]+)=', re.I)pattern2 = re.compile(r':([a-z][^,{}. [\]]+)', re.I)pattern3 = re.compile(r'\\"', re.I)with io.open(sys.argv[1]) as f:    headers = list(map(lambda f: f.strip(), f.readline().split(",")))    print(headers)    for line in f.readlines():        orig_line = line        #save the double quote cases, which mean there is a string with quotes inside        line = re.sub('""', "#", orig_line)        data = []        for i, l in enumerate(line.split('","')):            item = re.sub('^"|"$', "", l.rstrip())            if (item[0] == "{" and item[-1] == "}") or (item[0] == "[" and item[-1] == "]"):                data.append(headers[i] + ":" + item)            else: #we have a string                data.append(headers[i] + ": \"" + item + "\"")        line = "{" + ','.join(data) + "}"        line = pattern1.sub(r'"\1":', line)        line = pattern2.sub(r':"\1"', line)        #restate the double quotes to single ones, once inside the json        line = re.sub("#", '"', line)        print(line)


This method is not by modifying the Query.

Its by Post Processing For Javascript/Nodejs we can use the npm package athena-struct-parser.

Detailed Answer with Example

https://stackoverflow.com/a/67899845/6662952

Reference - https://www.npmjs.com/package/athena-struct-parser