PySpark (Python 2.7): How to flatten values after reduce
Simple list comprehension should be more than enough:
from datetime import datetimedef flatten(kvs): """ >>> kvs = ("852-YF-008", [ ... (datetime(2016, 5, 10, 0, 0), 0.0), ... (datetime(2016, 5, 9, 23, 59), 0.0)]) >>> flat = flatten(kvs) >>> len(flat) 2 >>> flat[0] ('852-YF-008', datetime.datetime(2016, 5, 10, 0, 0), 0.0) """ k, vs = kvs return [(k, v1, v2) for v1, v2 in vs]
In Python 2.7 you could also use lambda expression with tuple argument unpacking but this is not portable and generally discouraged:
lambda (k, vs): [(k, v1, v2) for v1, v2 in vs]
Version independent:
lambda kvs: [(kvs[0], v1, v2) for v1, v2 in kvs[1]]
Edit:
If all you need is writing partitioned data then convert to Parquet directly without reduceByKey
:
(sheet .flatMap(process) .map(lambda x: (x[0], ) + x[1]) .toDF(["key", "datettime", "value"]) .write .partitionBy("key") .parquet(output_path))