PySpark (Python 2.7): How to flatten values after reduce PySpark (Python 2.7): How to flatten values after reduce hadoop hadoop

PySpark (Python 2.7): How to flatten values after reduce


Simple list comprehension should be more than enough:

from datetime import datetimedef flatten(kvs):    """    >>> kvs = ("852-YF-008", [    ... (datetime(2016, 5, 10, 0, 0), 0.0),    ... (datetime(2016, 5, 9, 23, 59), 0.0)])    >>> flat = flatten(kvs)    >>> len(flat)    2    >>> flat[0]    ('852-YF-008', datetime.datetime(2016, 5, 10, 0, 0), 0.0)    """    k, vs = kvs    return [(k, v1, v2) for v1, v2 in vs]

In Python 2.7 you could also use lambda expression with tuple argument unpacking but this is not portable and generally discouraged:

lambda (k, vs): [(k, v1, v2) for v1, v2 in vs]

Version independent:

lambda kvs: [(kvs[0], v1, v2) for v1, v2 in kvs[1]]

Edit:

If all you need is writing partitioned data then convert to Parquet directly without reduceByKey:

(sheet    .flatMap(process)    .map(lambda x: (x[0], ) + x[1])    .toDF(["key", "datettime", "value"])    .write    .partitionBy("key")    .parquet(output_path))