R: using data.table := operations to calculate new columns R: using data.table := operations to calculate new columns r r

R: using data.table := operations to calculate new columns


Great question. Try this :

dt     TICKER     PERIOD       DATE ID VALUE[1,]    ABC 2010-12-31 2010-01-05  1   1.5[2,]    ABC 2010-12-31 2010-01-08  1   1.4[3,]    ABC 2010-12-31 2010-01-10  1   1.4[4,]    ABC 2010-12-31 2010-01-13  1   1.5[5,]    ABC 2010-12-31 2010-01-07  2   1.3[6,]    ABC 2010-12-31 2010-01-11  2   1.2[7,]    ABC 2010-12-31 2010-01-09  3   1.6[8,]    DEF 2011-12-31 2011-02-05  1   2.3ids = unique(dt$ID)dt[,PRIORAVG:=NA_real_]for (i in 1:nrow(dt))    dt[i,PRIORAVG:=dt[J(TICKER[i],PERIOD[i],setdiff(ids,ID[i]),DATE[i]),                      mean(VALUE,na.rm=TRUE),roll=TRUE,mult="last"]]dt     TICKER     PERIOD       DATE ID VALUE PRIORAVG[1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA[2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30[3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45[4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40[5,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50[6,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50[7,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35[8,]    DEF 2011-12-31 2011-02-05  1   2.3       NA

Then what you had already with a slight simplification ...

dt[,PREV:=dt[J(TICKER,PERIOD,ID,DATE-1),VALUE,roll=TRUE,mult="last"]]     TICKER     PERIOD       DATE ID VALUE PRIORAVG PREV[1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA   NA[2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30  1.5[3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45  1.4[4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40  1.4[5,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50   NA[6,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50  1.3[7,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35   NA[8,]    DEF 2011-12-31 2011-02-05  1   2.3       NA   NA

If this is ok as a prototype then a large speed improvement would be to keep the loop but use set() instead of :=, to reduce overhead :

for (i in 1:nrow(dt))    set(dt,i,6L,dt[J(TICKER[i],PERIOD[i],setdiff(ids,ID[i]),DATE[i]),                   mean(VALUE,na.rm=TRUE),roll=TRUE,mult="last"])dt     TICKER     PERIOD       DATE ID VALUE PRIORAVG PREV[1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA   NA[2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30  1.5[3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45  1.4[4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40  1.4[5,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50   NA[6,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50  1.3[7,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35   NA[8,]    DEF 2011-12-31 2011-02-05  1   2.3       NA   NA

That should be a lot faster than the repeated vector scans shown in the question.

Or, the operation could be vectorized. But that would be less easy to write and read due to the features of this task.

Btw, there isn't any data in the question that would test the 180 day requirement. If you add some and show expected output again then I'll add the calculation of age using join inherited scope I mentioned in comments.


Another possible approach using later versions of data.table:

library(data.table) #data.table_1.12.6 as of Nov 20, 2019cols <- copy(names(DT))DT[, c("MIN_DATE", "MAX_DATE") := .(DATE - 180L, DATE)]DT[, PRIORAVG :=         .SD[.SD, on=.(TICKER, PERIOD, DATE>=MIN_DATE, DATE<=MAX_DATE),            by=.EACHI, {                subdat <- .SD[x.ID!=i.ID]                pavg <- if (subdat[, .N > 0L])                    mean(subdat[, last(VALUE), ID]$V1, na.rm=TRUE)                else                     NA_real_                c(setNames(mget(paste0("i.", cols)), cols), .(PRIORAVG=pavg))            }]$PRIORAVG]DT[, PREV := shift(VALUE), .(TICKER, PERIOD, ID)]

output:

    TICKER     PERIOD       DATE ID VALUE   MIN_DATE   MAX_DATE PRIORAVG PREV 1:    ABC 2010-12-31 2010-01-05  1   1.5 2009-07-09 2010-01-05       NA   NA 2:    ABC 2010-12-31 2010-01-08  1   1.4 2009-07-12 2010-01-08     1.30  1.5 3:    ABC 2010-12-31 2010-01-10  1   1.4 2009-07-14 2010-01-10     1.45  1.4 4:    ABC 2010-12-31 2010-01-13  1   1.5 2009-07-17 2010-01-13     1.40  1.4 5:    ABC 2010-12-31 2010-04-01  1   1.7 2009-10-03 2010-04-01     1.40  1.5 6:    ABC 2010-12-31 2010-01-07  2   1.3 2009-07-11 2010-01-07     1.50   NA 7:    ABC 2010-12-31 2010-01-11  2   1.2 2009-07-15 2010-01-11     1.50  1.3 8:    ABC 2010-12-31 2010-04-02  2   1.8 2009-10-04 2010-04-02     1.65  1.2 9:    ABC 2010-12-31 2010-08-03  2   1.7 2010-02-04 2010-08-03     1.70  1.810:    ABC 2010-12-31 2010-01-09  3   1.6 2009-07-13 2010-01-09     1.35   NA11:    DEF 2011-12-31 2011-02-05  1   2.3 2010-08-09 2011-02-05       NA   NA