Pyspark - how to backfill a DataFrame?

pandas pyspark spark-dataframe pyspark-sql

Actually backfill on distributed dataset is not as easy task as in pandas (local) dataframe - you cannot be sure that value to fill exists in the same partition. I would use crossJoin with windowing, for example fo DF:

df = spark.createDataFrame([    ('2017-01-01', None),     ('2017-01-02', 'B'),     ('2017-01-03', None),     ('2017-01-04', None),     ('2017-01-05', 'E'),     ('2017-01-06', None),     ('2017-01-07', 'G')], ['date', 'value'])df.show()+----------+-----+|      date|value|+----------+-----+|2017-01-01| null||2017-01-02|    B||2017-01-03| null||2017-01-04| null||2017-01-05|    E||2017-01-06| null||2017-01-07|    G|+----------+-----+

The code would be:

from pyspark.sql.window import Windowdf.alias('a').crossJoin(df.alias('b')) \    .where((col('b.date') >= col('a.date')) & (col('a.value').isNotNull() | col('b.value').isNotNull())) \    .withColumn('rn', row_number().over(Window.partitionBy('a.date').orderBy('b.date'))) \    .where(col('rn') == 1) \    .select('a.date', coalesce('a.value', 'b.value').alias('value')) \    .orderBy('a.date') \    .show()+----------+-----+|      date|value|+----------+-----+|2017-01-01|    B||2017-01-02|    B||2017-01-03|    E||2017-01-04|    E||2017-01-05|    E||2017-01-06|    G||2017-01-07|    G|+----------+-----+

pandas pyspark spark-dataframe pyspark-sql

The last and first functions, with their ignorenulls=True flags, can be combined with the rowsBetween windowing. If we want to fill backwards, we select the first non-null that is between the current row and the end. If we want to fill forwards, we select the last non-null that is between the beginning and the current row.

from pyspark.sql import functions as Ffrom pyspark.sql.window import Window as Wimport sysdf.withColumn(  'data',  F.first(    F.col('data'),    ignorenulls=True  ) \    .over(      W.orderBy('date').rowsBetween(0, sys.maxsize)    )  )

source on filling in spark: https://towardsdatascience.com/end-to-end-time-series-interpolation-in-pyspark-filling-the-gap-5ccefc6b7fc9

CodeHunter

Pyspark - how to backfill a DataFrame?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last