Spark union of multiple RDDs

If these are RDDs you can use SparkContext.union method:

rdd1 = sc.parallelize([1, 2, 3])rdd2 = sc.parallelize([4, 5, 6])rdd3 = sc.parallelize([7, 8, 9])rdd = sc.union([rdd1, rdd2, rdd3])rdd.collect()## [1, 2, 3, 4, 5, 6, 7, 8, 9]

There is no DataFrame equivalent but it is just a matter of a simple one-liner:

from functools import reduce  # For Python 3.xfrom pyspark.sql import DataFramedef unionAll(*dfs):    return reduce(DataFrame.unionAll, dfs)df1 = sqlContext.createDataFrame([(1, "foo1"), (2, "bar1")], ("k", "v"))df2 = sqlContext.createDataFrame([(3, "foo2"), (4, "bar2")], ("k", "v"))df3 = sqlContext.createDataFrame([(5, "foo3"), (6, "bar3")], ("k", "v"))unionAll(df1, df2, df3).show()## +---+----+## |  k|   v|## +---+----+## |  1|foo1|## |  2|bar1|## |  3|foo2|## |  4|bar2|## |  5|foo3|## |  6|bar3|## +---+----+

If number of DataFrames is large using SparkContext.union on RDDs and recreating DataFrame may be a better choice to avoid issues related to the cost of preparing an execution plan:

def unionAll(*dfs):    first, *_ = dfs  # Python 3.x, for 2.x you'll have to unpack manually    return first.sql_ctx.createDataFrame(        first.sql_ctx._sc.union([df.rdd for df in dfs]),        first.schema    )

python apache-spark pyspark rdd

You can also use addition for UNION between RDDs

rdd = sc.parallelize([1, 1, 2, 3])(rdd + rdd).collect()## [1, 1, 2, 3, 1, 1, 2, 3]

python apache-spark pyspark rdd

Unfortunately it's the only way to UNION tables in Spark. However instead of

first = rdd1.union(rdd2)second = first.union(rdd3)third = second.union(rdd4)...

you can perform it in a little bit cleaner way like this:

result = rdd1.union(rdd2).union(rdd3).union(rdd4)

CodeHunter

Spark union of multiple RDDs

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last