Forward fill missing values in Spark/Python

hadoop apache-spark pyspark spark-dataframe apache-spark-mllib

I've found a solution that works without additional coding by using a Window here. So Jeff was right, there is a solution. full code boelow, I'll briefly explain what it does, for more details just look at the blog.

from pyspark.sql import Windowfrom pyspark.sql.functions import lastimport sys# define the windowwindow = Window.orderBy('time')\               .rowsBetween(-sys.maxsize, 0)# define the forward-filled columnfilled_column_temperature = last(df6['temperature'], ignorenulls=True).over(window)# do the fill spark_df_filled = df6.withColumn('temperature_filled',  filled_column_temperature)

So the idea is to define a Window sliding (more on sliding windows here) through the data which always contains the actual row and ALL previous ones:

    window = Window.orderBy('time')\           .rowsBetween(-sys.maxsize, 0)

Note that we sort by time, so data is in the correct order. Also note that using "-sys.maxsize" ensures that the window is always including all previous data and is contineously growing as it traverses through the data top-down, but there might be more efficient solutions.

Using the "last" function, we are always addressing the last row in that window. By passing "ignorenulls=True" we define that if the current row is null, then the function will return the most recent (last) non-null value in the window. Otherwise the actual row's value is used.

Done.

CodeHunter

Forward fill missing values in Spark/Python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last