How to delete a row by reference in data.table? How to delete a row by reference in data.table? r r

How to delete a row by reference in data.table?


Good question. data.table can't delete rows by reference yet.

data.table can add and delete columns by reference since it over-allocates the vector of column pointers, as you know. The plan is to do something similar for rows and allow fast insert and delete. A row delete would use memmove in C to budge up the items (in each and every column) after the deleted rows. Deleting a row in the middle of the table would still be quite inefficient compared to a row store database such as SQL, which is more suited for fast insert and delete of rows wherever those rows are in the table. But still, it would be a lot faster than copying a new large object without the deleted rows.

On the other hand, since column vectors would be over-allocated, rows could be inserted (and deleted) at the end, instantly; e.g., a growing time series.


It's filed as an issue: Delete rows by reference.


the approach that i have taken in order to make memory use be similar to in-place deletion is to subset a column at a time and delete. not as fast as a proper C memmove solution, but memory use is all i care about here. something like this:

DT = data.table(col1 = 1:1e6)cols = paste0('col', 2:100)for (col in cols){ DT[, (col) := 1:1e6] }keep.idxs = sample(1e6, 9e5, FALSE) # keep 90% of entriesDT.subset = data.table(col1 = DT[['col1']][keep.idxs]) # this is the subsetted tablefor (col in cols){  DT.subset[, (col) := DT[[col]][keep.idxs]]  DT[, (col) := NULL] #delete}


Here is a working function based on @vc273's answer and @Frank's feedback.

delete <- function(DT, del.idxs) {           # pls note 'del.idxs' vs. 'keep.idxs'  keep.idxs <- setdiff(DT[, .I], del.idxs);  # select row indexes to keep  cols = names(DT);  DT.subset <- data.table(DT[[1]][keep.idxs]); # this is the subsetted table  setnames(DT.subset, cols[1]);  for (col in cols[2:length(cols)]) {    DT.subset[, (col) := DT[[col]][keep.idxs]];    DT[, (col) := NULL];  # delete  }   return(DT.subset);}

And example of its usage:

dat <- delete(dat,del.idxs)   ## Pls note 'del.idxs' instead of 'keep.idxs'

Where "dat" is a data.table. Removing 14k rows from 1.4M rows takes 0.25 sec on my laptop.

> dim(dat)[1] 1419393      25> system.time(dat <- delete(dat,del.idxs))   user  system elapsed    0.23    0.02    0.25 > dim(dat)[1] 1404715      25> 

PS. Since I am new to SO, I could not add comment to @vc273's thread :-(