Subset string by counting specific characters Subset string by counting specific characters r r

Subset string by counting specific characters


You can accomplish your task with a simple call to str_extract from the stringr package:

library(stringr)strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")str_extract(strings, '([^AGN]*[AGN]){3}')# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:

str_extract(strings, '([^AGN]*[AGN]){4}')# [1] "ABBSDGNHN"  NA           "AGNA"       "GGGDSRTYHG"

There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:

m <- regexpr('([^AGN]*[AGN]){3}', strings)regmatches(strings, m)# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

Alternatively, you can use sub:

sub('(([^AGN]*[AGN]){3}).*', '\\1', strings)# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"


Here is a base R option using strsplit

sapply(strsplit(strings, ""), function(x)    paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))#[1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

Or in the tidyverse

library(tidyverse)map_chr(str_split(strings, ""),     ~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))


Identify positions of pattern using gregexpr then extract n-th position (3) and substring everything from 1 to this n-th position using subset.

nChars <- 3pattern <- "A|G|N"# Using sapply to iterate over strings vectorsapply(strings, function(x) substr(x, 1, gregexpr(pattern, x)[[1]][nChars]))

PS:

If there's a string that doesn't have 3 matches it will generate NA, so you just need to use na.omit on the final result.