Spark Dataframe distinguish columns with duplicated name Spark Dataframe distinguish columns with duplicated name python python

Spark Dataframe distinguish columns with duplicated name


Lets start with some data:

from pyspark.mllib.linalg import SparseVectorfrom pyspark.sql import Rowdf1 = sqlContext.createDataFrame([    Row(a=107831, f=SparseVector(        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),    Row(a=125231, f=SparseVector(        5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),])df2 = sqlContext.createDataFrame([    Row(a=107831, f=SparseVector(        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),    Row(a=107831, f=SparseVector(        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),])

There are a few ways you can approach this problem. First of all you can unambiguously reference child table columns using parent columns:

df1.join(df2, df1['a'] == df2['a']).select(df1['f']).show(2)##  +--------------------+##  |                   f|##  +--------------------+##  |(5,[0,1,2,3,4],[0...|##  |(5,[0,1,2,3,4],[0...|##  +--------------------+

You can also use table aliases:

from pyspark.sql.functions import coldf1_a = df1.alias("df1_a")df2_a = df2.alias("df2_a")df1_a.join(df2_a, col('df1_a.a') == col('df2_a.a')).select('df1_a.f').show(2)##  +--------------------+##  |                   f|##  +--------------------+##  |(5,[0,1,2,3,4],[0...|##  |(5,[0,1,2,3,4],[0...|##  +--------------------+

Finally you can programmatically rename columns:

df1_r = df1.select(*(col(x).alias(x + '_df1') for x in df1.columns))df2_r = df2.select(*(col(x).alias(x + '_df2') for x in df2.columns))df1_r.join(df2_r, col('a_df1') == col('a_df2')).select(col('f_df1')).show(2)## +--------------------+## |               f_df1|## +--------------------+## |(5,[0,1,2,3,4],[0...|## |(5,[0,1,2,3,4],[0...|## +--------------------+


I would recommend that you change the column names for your join.

df1.select(col("a") as "df1_a", col("f") as "df1_f")   .join(df2.select(col("a") as "df2_a", col("f") as "df2_f"), col("df1_a" === col("df2_a"))

The resulting DataFrame will have schema

(df1_a, df1_f, df2_a, df2_f)


There is a simpler way than writing aliases for all of the columns you are joining on by doing:

df1.join(df2,['a'])

This works if the key that you are joining on is the same in both tables.

Seehttps://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html