Filter rows which contain a certain string Filter rows which contain a certain string r r

Filter rows which contain a certain string


The answer to the question was already posted by the @latemail in the comments above. You can use regular expressions for the second and subsequent arguments of filter like this:

dplyr::filter(df, !grepl("RTB",TrackingPixel))

Since you have not provided the original data, I will add a toy example using the mtcars data set. Imagine you are only interested in cars produced by Mazda or Toyota.

mtcars$type <- rownames(mtcars)dplyr::filter(mtcars, grepl('Toyota|Mazda', type))   mpg cyl  disp  hp drat    wt  qsec vs am gear carb           type1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      Mazda RX42 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  Mazda RX4 Wag3 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 Toyota Corolla4 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona

If you would like to do it the other way round, namely excluding Toyota and Mazda cars, the filter command looks like this:

dplyr::filter(mtcars, !grepl('Toyota|Mazda', type))


Solution

It is possible to use str_detect of the stringr package included in the tidyverse package. str_detect returns True or False as to whether the specified vector contains some specific string. It is possible to filter using this boolean value. See Introduction to stringr for details about stringr package.

library(tidyverse)# ─ Attaching packages ──────────────────── tidyverse 1.2.1 ─# ✔ ggplot2 2.2.1     ✔ purrr   0.2.4# ✔ tibble  1.4.2     ✔ dplyr   0.7.4# ✔ tidyr   0.7.2     ✔ stringr 1.2.0# ✔ readr   1.1.1     ✔ forcats 0.3.0# ─ Conflicts ───────────────────── tidyverse_conflicts() ─# ✖ dplyr::filter() masks stats::filter()# ✖ dplyr::lag()    masks stats::lag()mtcars$type <- rownames(mtcars)mtcars %>%  filter(str_detect(type, 'Toyota|Mazda'))# mpg cyl  disp  hp drat    wt  qsec vs am gear carb           type# 1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      Mazda RX4# 2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  Mazda RX4 Wag# 3 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 Toyota Corolla# 4 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona

The good things about Stringr

We should use rather stringr::str_detect() than base::grepl(). This is because there are the following reasons.

  • The functions provided by the stringr package start with the prefix str_, which makes the code easier to read.
  • The first argument of the functions of stringr package is always the data.frame (or value), then comes the parameters.(Thank you Paolo)
object <- "stringr"# The functions with the same prefix `str_`.# The first argument is an object.stringr::str_count(object) # -> 7stringr::str_sub(object, 1, 3) # -> "str"stringr::str_detect(object, "str") # -> TRUEstringr::str_replace(object, "str", "") # -> "ingr"# The function names without common points.# The position of the argument of the object also does not match.base::nchar(object) # -> 7base::substr(object, 1, 3) # -> "str"base::grepl("str", object) # -> TRUEbase::sub("str", "", object) # -> "ingr"

Benchmark

The results of the benchmark test are as follows. For large dataframe, str_detect is faster.

library(rbenchmark)library(tidyverse)# The data. Data expo 09. ASA Statistics Computing and Graphics # http://stat-computing.org/dataexpo/2009/the-data.htmldf <- read_csv("Downloads/2008.csv")print(dim(df))# [1] 7009728      29benchmark(  "str_detect" = {df %>% filter(str_detect(Dest, 'MCO|BWI'))},  "grepl" = {df %>% filter(grepl('MCO|BWI', Dest))},  replications = 10,  columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))# test replications elapsed relative user.self sys.self# 2      grepl           10  16.480    1.513    16.195    0.248# 1 str_detect           10  10.891    1.000     9.594    1.281


This answer similar to others, but using preferred stringr::str_detect and dplyr rownames_to_column.

library(tidyverse)mtcars %>%   rownames_to_column("type") %>%   filter(stringr::str_detect(type, 'Toyota|Mazda') )#>             type  mpg cyl  disp  hp drat    wt  qsec vs am gear carb#> 1      Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4#> 2  Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4#> 3 Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1#> 4  Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1

Created on 2018-06-26 by the reprex package (v0.2.0).