PySpark (Python 2.7): How to flatten values after reduce

python python-2.7 hadoop apache-spark pyspark

Simple list comprehension should be more than enough:

from datetime import datetimedef flatten(kvs):    """    >>> kvs = ("852-YF-008", [    ... (datetime(2016, 5, 10, 0, 0), 0.0),    ... (datetime(2016, 5, 9, 23, 59), 0.0)])    >>> flat = flatten(kvs)    >>> len(flat)    2    >>> flat[0]    ('852-YF-008', datetime.datetime(2016, 5, 10, 0, 0), 0.0)    """    k, vs = kvs    return [(k, v1, v2) for v1, v2 in vs]

In Python 2.7 you could also use lambda expression with tuple argument unpacking but this is not portable and generally discouraged:

lambda (k, vs): [(k, v1, v2) for v1, v2 in vs]

Version independent:

lambda kvs: [(kvs[0], v1, v2) for v1, v2 in kvs[1]]

Edit:

If all you need is writing partitioned data then convert to Parquet directly without reduceByKey:

(sheet    .flatMap(process)    .map(lambda x: (x[0], ) + x[1])    .toDF(["key", "datettime", "value"])    .write    .partitionBy("key")    .parquet(output_path))

CodeHunter

PySpark (Python 2.7): How to flatten values after reduce

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last