Filter rows which contain a certain string
The answer to the question was already posted by the @latemail in the comments above. You can use regular expressions for the second and subsequent arguments of filter
like this:
dplyr::filter(df, !grepl("RTB",TrackingPixel))
Since you have not provided the original data, I will add a toy example using the mtcars
data set. Imagine you are only interested in cars produced by Mazda or Toyota.
mtcars$type <- rownames(mtcars)dplyr::filter(mtcars, grepl('Toyota|Mazda', type)) mpg cyl disp hp drat wt qsec vs am gear carb type1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX42 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla4 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona
If you would like to do it the other way round, namely excluding Toyota and Mazda cars, the filter
command looks like this:
dplyr::filter(mtcars, !grepl('Toyota|Mazda', type))
Solution
It is possible to use str_detect
of the stringr
package included in the tidyverse
package. str_detect
returns True
or False
as to whether the specified vector contains some specific string. It is possible to filter using this boolean value. See Introduction to stringr for details about stringr
package.
library(tidyverse)# ─ Attaching packages ──────────────────── tidyverse 1.2.1 ─# ✔ ggplot2 2.2.1 ✔ purrr 0.2.4# ✔ tibble 1.4.2 ✔ dplyr 0.7.4# ✔ tidyr 0.7.2 ✔ stringr 1.2.0# ✔ readr 1.1.1 ✔ forcats 0.3.0# ─ Conflicts ───────────────────── tidyverse_conflicts() ─# ✖ dplyr::filter() masks stats::filter()# ✖ dplyr::lag() masks stats::lag()mtcars$type <- rownames(mtcars)mtcars %>% filter(str_detect(type, 'Toyota|Mazda'))# mpg cyl disp hp drat wt qsec vs am gear carb type# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag# 3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla# 4 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona
The good things about Stringr
We should use rather stringr::str_detect()
than base::grepl()
. This is because there are the following reasons.
- The functions provided by the
stringr
package start with the prefixstr_
, which makes the code easier to read. - The first argument of the functions of
stringr
package is always the data.frame (or value), then comes the parameters.(Thank you Paolo)
object <- "stringr"# The functions with the same prefix `str_`.# The first argument is an object.stringr::str_count(object) # -> 7stringr::str_sub(object, 1, 3) # -> "str"stringr::str_detect(object, "str") # -> TRUEstringr::str_replace(object, "str", "") # -> "ingr"# The function names without common points.# The position of the argument of the object also does not match.base::nchar(object) # -> 7base::substr(object, 1, 3) # -> "str"base::grepl("str", object) # -> TRUEbase::sub("str", "", object) # -> "ingr"
Benchmark
The results of the benchmark test are as follows. For large dataframe, str_detect
is faster.
library(rbenchmark)library(tidyverse)# The data. Data expo 09. ASA Statistics Computing and Graphics # http://stat-computing.org/dataexpo/2009/the-data.htmldf <- read_csv("Downloads/2008.csv")print(dim(df))# [1] 7009728 29benchmark( "str_detect" = {df %>% filter(str_detect(Dest, 'MCO|BWI'))}, "grepl" = {df %>% filter(grepl('MCO|BWI', Dest))}, replications = 10, columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))# test replications elapsed relative user.self sys.self# 2 grepl 10 16.480 1.513 16.195 0.248# 1 str_detect 10 10.891 1.000 9.594 1.281
This answer similar to others, but using preferred stringr::str_detect
and dplyr rownames_to_column
.
library(tidyverse)mtcars %>% rownames_to_column("type") %>% filter(stringr::str_detect(type, 'Toyota|Mazda') )#> type mpg cyl disp hp drat wt qsec vs am gear carb#> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4#> 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4#> 3 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1#> 4 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Created on 2018-06-26 by the reprex package (v0.2.0).