Converting a dataframe into JSON (in pyspark) and then selecting desired fields

python json apache-spark pyspark

If the result of result.toJSON().collect() is a JSON encoded string, then you would use json.loads() to convert it to a dict. The issue you're running into is that when you iterate a dict with a for loop, you're given the keys of the dict. In your for loop, you're treating the key as if it's a dict, when in fact it is just a string. Try this:

# toJSON() turns each row of the DataFrame into a JSON string# calling first() on the result will fetch the first row.results = json.loads(result.toJSON().first())for key in results:    print results[key]# To decode the entire DataFrame iterate over the result# of toJSON()def print_rows(row):    data = json.loads(row)    for key in data:        print "{key}:{value}".format(key=key, value=data[key])results = result.toJSON()results.foreach(print_rows)

EDIT: The issue is that collect returns a list, not a dict. I've updated the code. Always read the docs.

collect() Return a list that contains all of the elements in this RDD.
Note This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.

EDIT2: I can't emphasize enough, always read the docs.

EDIT3: Look here.

python json apache-spark pyspark

import json>>> df = sqlContext.read.table("n1")>>> df.show()+-----+-------+----+---------------+-------+----+|   c1|     c2|  c3|             c4|     c5|  c6|+-----+-------+----+---------------+-------+----+|00001|Content|   1|Content-article|       |2018||00002|Content|null|Content-article|Content|2015|+-----+-------+----+---------------+-------+----+>>> results = df.toJSON().map(lambda j: json.loads(j)).collect()>>> for i in results: print i["c1"], i["c6"]... 00001 201800002 2015

python json apache-spark pyspark

Here is what worked for me:

df_json = df.toJSON()for row in df_json.collect():    #json string    print(row)     #json object    line = json.loads(row)     print(line[some_key])

Keep in mind that using .collect() is not advisable, since it collects the distributed data frames, and defeats the purpose of using data frames.

CodeHunter

Converting a dataframe into JSON (in pyspark) and then selecting desired fields

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last