Forward fill missing values in Spark/Python Forward fill missing values in Spark/Python hadoop hadoop

Forward fill missing values in Spark/Python


I've found a solution that works without additional coding by using a Window here. So Jeff was right, there is a solution. full code boelow, I'll briefly explain what it does, for more details just look at the blog.

from pyspark.sql import Windowfrom pyspark.sql.functions import lastimport sys# define the windowwindow = Window.orderBy('time')\               .rowsBetween(-sys.maxsize, 0)# define the forward-filled columnfilled_column_temperature = last(df6['temperature'], ignorenulls=True).over(window)# do the fill spark_df_filled = df6.withColumn('temperature_filled',  filled_column_temperature)

So the idea is to define a Window sliding (more on sliding windows here) through the data which always contains the actual row and ALL previous ones:

    window = Window.orderBy('time')\           .rowsBetween(-sys.maxsize, 0)

Note that we sort by time, so data is in the correct order. Also note that using "-sys.maxsize" ensures that the window is always including all previous data and is contineously growing as it traverses through the data top-down, but there might be more efficient solutions.

Using the "last" function, we are always addressing the last row in that window. By passing "ignorenulls=True" we define that if the current row is null, then the function will return the most recent (last) non-null value in the window. Otherwise the actual row's value is used.

Done.