Replacing NAs with latest non-NA value

r data.table zoo r-faq

You probably want to use the na.locf() function from the zoo package to carry the last observation forward to replace your NA values.

Here is the beginning of its usage example from the help page:

library(zoo)az <- zoo(1:6)bz <- zoo(c(2,NA,1,4,5,2))na.locf(bz)1 2 3 4 5 6 2 2 1 4 5 2 na.locf(bz, fromLast = TRUE)1 2 3 4 5 6 2 1 1 4 5 2 cz <- zoo(c(NA,9,3,2,3,2))na.locf(cz)2 3 4 5 6 9 3 2 3 2

r data.table zoo r-faq

Sorry for digging up an old question. I couldn't look up the function to do this job on the train, so I wrote one myself.

I was proud to find out that it's a tiny bit faster.
It's less flexible though.

But it plays nice with ave, which is what I needed.

repeat.before = function(x) {   # repeats the last non NA value. Keeps leading NA    ind = which(!is.na(x))      # get positions of nonmissing values    if(is.na(x[1]))             # if it begins with a missing, add the           ind = c(1,ind)        # first position to the indices    rep(x[ind], times = diff(   # repeat the values at these indices       c(ind, length(x) + 1) )) # diffing the indices + length yields how often }                               # they need to be repeatedx = c(NA,NA,'a',NA,NA,NA,NA,NA,NA,NA,NA,'b','c','d',NA,NA,NA,NA,NA,'e')  xx = rep(x, 1000000)  system.time({ yzoo = na.locf(xx,na.rm=F)})  ## user  system elapsed   ## 2.754   0.667   3.406   system.time({ yrep = repeat.before(xx)})  ## user  system elapsed   ## 0.597   0.199   0.793

Edit

As this became my most upvoted answer, I was reminded often that I don't use my own function, because I often need zoo's maxgap argument. Because zoo has some weird problems in edge cases when I use dplyr + dates that I couldn't debug, I came back to this today to improve my old function.

I benchmarked my improved function and all the other entries here. For the basic set of features, tidyr::fill is fastest while also not failing the edge cases. The Rcpp entry by @BrandonBertelsen is faster still, but it's inflexible regarding the input's type (he tested edge cases incorrectly due to a misunderstanding of all.equal).

If you need maxgap, my function below is faster than zoo (and doesn't have the weird problems with dates).

I put up the documentation of my tests.

new function

repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {    if (!forward) x = rev(x)           # reverse x twice if carrying backward    ind = which(!is.na(x))             # get positions of nonmissing values    if (is.na(x[1]) && !na.rm)         # if it begins with NA        ind = c(1,ind)                 # add first pos    rep_times = diff(                  # diffing the indices + length yields how often        c(ind, length(x) + 1) )          # they need to be repeated    if (maxgap < Inf) {        exceed = rep_times - 1 > maxgap  # exceeding maxgap        if (any(exceed)) {               # any exceed?            ind = sort(c(ind[exceed] + 1, ind))      # add NA in gaps            rep_times = diff(c(ind, length(x) + 1) ) # diff again        }    }    x = rep(x[ind], times = rep_times) # repeat the values at these indices    if (!forward) x = rev(x)           # second reversion    x}

I've also put the function in my formr package (Github only).

r data.table zoo r-faq

a data.table solution:

dt <- data.table(y = c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA))dt[, y_forward_fill := y[1], .(cumsum(!is.na(y)))]dt     y y_forward_fill 1: NA             NA 2:  2              2 3:  2              2 4: NA              2 5: NA              2 6:  3              3 7: NA              3 8:  4              4 9: NA              410: NA              4

this approach could work with forward filling zeros as well:

dt <- data.table(y = c(0, 2, -2, 0, 0, 3, 0, -4, 0, 0))dt[, y_forward_fill := y[1], .(cumsum(y != 0))]dt     y y_forward_fill 1:  0              0 2:  2              2 3: -2             -2 4:  0             -2 5:  0             -2 6:  3              3 7:  0              3 8: -4             -4 9:  0             -410:  0             -4

this method becomes very useful on data at scale and where you would want to perform a forward fill by group(s), which is trivial with data.table. just add the group(s) to the by clause prior to the cumsum logic.

dt <- data.table(group = sample(c('a', 'b'), 20, replace = TRUE), y = sample(c(1:4, rep(NA, 4)), 20 , replace = TRUE))dt <- dt[order(group)]dt[, y_forward_fill := y[1], .(group, cumsum(!is.na(y)))]dt    group  y y_forward_fill 1:     a NA             NA 2:     a NA             NA 3:     a NA             NA 4:     a  2              2 5:     a NA              2 6:     a  1              1 7:     a NA              1 8:     a  3              3 9:     a NA              310:     a NA              311:     a  4              412:     a NA              413:     a  1              114:     a  4              415:     a NA              416:     a  3              317:     b  4              418:     b NA              419:     b NA              420:     b  2              2

CodeHunter

Replacing NAs with latest non-NA value

Edit

new function

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last