df1 subtract df2 not working as expected in Pyspark when the records are not in order
So, I found that my data had unique ID
and I used the below code to find different records
when the counts are different in two datasets
. If the counts are different in two datasets, in-built methods like subtract
or exceptAll
won't work. I even tried various types of joins
but didn't work out. You may have to do something similar like I did.
df2.createOrReplaceTempView("temp2")df1.createOrReplaceTempView("temp1")spark.sql("select * from temp2 where `Order ID` not in (select `Order ID` from temp1)").show()
So, this gives me the 2 records that I had been looking for from df2.subtract(df1)
Hope this approach will help someone someday!