How to create a lag variable within each group? How to create a lag variable within each group? r r

How to create a lag variable within each group?


You could do this within data.table

 library(data.table) data[, lag.value:=c(NA, value[-.N]), by=groups]  data #   time groups       value   lag.value #1:    1      a  0.02779005          NA #2:    2      a  0.88029938  0.02779005 #3:    3      a -1.69514201  0.88029938 #4:    1      b -1.27560288          NA #5:    2      b -0.65976434 -1.27560288 #6:    3      b -1.37804943 -0.65976434 #7:    4      b  0.12041778 -1.37804943

For multiple columns:

nm1 <- grep("^value", colnames(data), value=TRUE)nm2 <- paste("lag", nm1, sep=".")data[, (nm2):=lapply(.SD, function(x) c(NA, x[-.N])), by=groups, .SDcols=nm1] data#    time groups      value     value1      value2  lag.value lag.value1#1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247#3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814#4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA#5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812#6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432#7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406#    lag.value2#1:          NA#2:  1.12493092#3: -0.04493361#4:          NA#5:  0.94383621#6:  0.82122120#7:  0.59390132

Update

From data.table versions >= v1.9.5, we can use shift with type as lag or lead. By default, the type is lag.

data[, (nm2) :=  shift(.SD), by=groups, .SDcols=nm1]#   time groups      value     value1      value2  lag.value lag.value1#1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247#3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814#4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA#5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812#6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432#7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406#    lag.value2#1:          NA#2:  1.12493092#3: -0.04493361#4:          NA#5:  0.94383621#6:  0.82122120#7:  0.59390132

If you need the reverse, use type=lead

nm3 <- paste("lead", nm1, sep=".")

Using the original dataset

  data[, (nm3) := shift(.SD, type='lead'), by = groups, .SDcols=nm1]  #  time groups      value     value1      value2 lead.value lead.value1  #1:    1      b -0.6264538  0.7383247  1.12493092  0.1836433   0.5757814  #2:    2      b  0.1836433  0.5757814 -0.04493361 -0.8356286  -0.3053884  #3:    3      b -0.8356286 -0.3053884 -0.01619026         NA          NA  #4:    1      a  1.5952808  1.5117812  0.94383621  0.3295078   0.3898432  #5:    2      a  0.3295078  0.3898432  0.82122120 -0.8204684  -0.6212406  #6:    3      a -0.8204684 -0.6212406  0.59390132  0.4874291  -2.2146999  #7:    4      a  0.4874291 -2.2146999  0.91897737         NA          NA #   lead.value2 #1: -0.04493361 #2: -0.01619026 #3:          NA #4:  0.82122120 #5:  0.59390132 #6:  0.91897737 #7:          NA

data

 set.seed(1) data <- data.table(time =c(1:3,1:4),groups = c(rep(c("b","a"),c(3,4))),             value = rnorm(7), value1=rnorm(7), value2=rnorm(7))


Using package dplyr:

library(dplyr)data <-     data %>%    group_by(groups) %>%    mutate(lag.value = dplyr::lag(value, n = 1, default = NA))

gives

> dataSource: local data table [7 x 4]Groups: groups  time groups       value   lag.value1    1      a  0.07614866          NA2    2      a -0.02784712  0.076148663    3      a  1.88612245 -0.027847124    1      b  0.26526825          NA5    2      b  1.23820506  0.265268256    3      b  0.09276648  1.238205067    4      b -0.09253594  0.09276648

As noted by @BrianD, this implicitly assumes that value is sorted by group already. If not, either sort it by group, or use the order_by argument in lag. Also note that due to an existing issue with some versions of dplyr, for safety, arguments and the namespace should be explicitly given.


In base R, this will do the job:

data$lag.value <- c(NA, data$value[-nrow(data)])data$lag.value[which(!duplicated(data$groups))] <- NA

The first line adds a string of lagged (+1) observations. The second string corrects the first entry of each group, as the lagged observation is from previous group.

Note that data is of format data.frame to not use data.table.