Remove duplicated rows using dplyr Remove duplicated rows using dplyr r r

Remove duplicated rows using dplyr


Here is a solution using dplyr >= 0.5.

library(dplyr)set.seed(123)df <- data.frame(  x = sample(0:1, 10, replace = T),  y = sample(0:1, 10, replace = T),  z = 1:10)> df %>% distinct(x, y, .keep_all = TRUE)    x y z  1 0 1 1  2 1 0 2  3 1 1 4


Note: dplyr now contains the distinct function for this purpose.

Original answer below:


library(dplyr)set.seed(123)df <- data.frame(  x = sample(0:1, 10, replace = T),  y = sample(0:1, 10, replace = T),  z = 1:10)

One approach would be to group, and then only keep the first row:

df %>% group_by(x, y) %>% filter(row_number(z) == 1)## Source: local data frame [3 x 3]## Groups: x, y## ##   x y z## 1 0 1 1## 2 1 0 2## 3 1 1 4

(In dplyr 0.2 you won't need the dummy z variable and will just beable to write row_number() == 1)

I've also been thinking about adding a slice() function that wouldwork like:

df %>% group_by(x, y) %>% slice(from = 1, to = 1)

Or maybe a variation of unique() that would let you select whichvariables to use:

df %>% unique(x, y)


For completeness’ sake, the following also works:

df %>% group_by(x) %>% filter (! duplicated(y))

However, I prefer the solution using distinct, and I suspect it’s faster, too.