Left join using data.table Left join using data.table r r

Left join using data.table


If you want to add the b values of B to A, then it's best to join A with B and update A by reference as follows:

A[B, on = 'a', bb := i.b]

which gives:

> A   a  b bb1: 1 12 NA2: 2 13 133: 3 14 144: 4 15 NA

This is a better approach than using B[A, on='a'] because the latter just prints the result to the console. When you want to get the results back into A, you need to use A <- B[A, on='a'] which will give you the same result.

The reason why A[B, on = 'a', bb := i.b] is better than A <- B[A, on = 'a'] is memory efficiency. With A[B, on = 'a', bb := i.b] the location of A in memory stays the same:

> address(A)[1] "0x102afa5d0"> A[B, on = 'a', bb := i.b]> address(A)[1] "0x102afa5d0"

While on the other hand with A <- B[A, on = 'a'], a new object is created and saved in memory as A and hence has another location in memory:

> address(A)[1] "0x102abae50"> A <- B[A, on = 'a']> address(A)[1] "0x102aa7e30"

Using merge (merge.data.table) results in a similar change in memory location:

> address(A)[1] "0x111897e00"> A <- merge(A, B, by = 'a', all.x = TRUE)> address(A)[1] "0x1118ab000"

For memory efficiency it is thus better to use an 'update-by-reference-join' syntax:

A[B, on = 'a', bb := i.b] 

Although this doesn't make a noticeable difference with small datasets like these, it does make a difference on large datasets for which data.table was designed.

Probably also worth mentioning is that the order of A stays the same.


To see the effect on speed and memory use, let's benchmark with some larger datasets (for data, see the 2nd part of the used data-section below):

library(bench)bm <- mark(AA <- BB[AA, on = .(aa)],           AA[BB, on = .(aa), cc := cc],           iterations = 1)

which gives (only relevant measurements shown):

> bm[,c(1,3,5)]# A tibble: 2 x 3  expression                         median mem_alloc  <bch:expr>                       <bch:tm> <bch:byt>1 AA <- BB[AA, on = .(aa)]            4.98s     4.1GB2 AA[BB, on = .(aa), `:=`(cc, cc)] 560.88ms   384.6MB

So, in this setup the 'update-by-reference-join' is about 9 times faster and consumes 11 times less memory.

NOTE: Gains in speed and memory use might differ in different setups.


Used data:

# initial datasetsA <- data.table(a = 1:4, b = 12:15)B <- data.table(a = 2:3, b = 13:14)# large datasets for the benchmarkset.seed(2019)AA <- data.table(aa = 1:1e8, bb = sample(12:19, 1e7, TRUE))BB <- data.table(aa = sample(AA$a, 2e5), cc = sample(2:8, 2e5, TRUE))


You can try this:

# used data# set the key in 'B' to the column which you use to joinA <- data.table(a = 1:4, b = 12:15)B <- data.table(a = 2:3, b = 13:14, key = 'a') B[A]


For the sake of completeness, I add the table.express version of an answer to your questions. table.express nicely extends the tidyverse language to data.table making it a handy tool to work fastly with huge datasets. Here is the solution using your datasets from the question above:

merge_test = dataA %>% left_join(dataB, by="A")

A left_join keeps all rows from dataA in the joined dataset.