Creating a new regex based on the returned results and rules of a previous regex | Indexing a regex and seeing how the regex has matched a substring Creating a new regex based on the returned results and rules of a previous regex | Indexing a regex and seeing how the regex has matched a substring shell shell

Creating a new regex based on the returned results and rules of a previous regex | Indexing a regex and seeing how the regex has matched a substring


This can be done programmatically in Perl, or any other language.

Since you need input from two different files, you cannot do this in pure regex, as regex cannot read files. You cannot even do it in one pattern, as no regex engine remembers what you matched before on a different input string. It has to be done in the program surrounding your matches, which should very well be regex, as that's what regex is meant for.

You can build the second pattern up step by step. I've implemented a more advanced version in Perl that can easily be adapted to suit other pattern combinations as well, without changing the actual code that does the work.

Instead of file 1, I will use the DATA section. It holds all three example input strings. Instead of file 2, I use your example output for the third input string.

The main idea behind this is to split up both patterns into sub-patterns. For the first one, we can simply use an array of patterns. For the second one, we create anonymous functions that we will call with the match results from the first pattern to construct the second complete pattern. Most of them just return a fixed string, but two actually take a value from the arguments to build the complements.

use strict;use warnings;sub complement {    my $string = shift;    $string =~ tr/ATGC/TACG/; # this is a transliteration, faster than s///    return $string;}# first regex, split into sub-patternsmy @first = (     qr([ACGT]{1,12000}),     qr(AAC),     qr([AG]{2,5}),     qr([ACGT]{2,5}),     qr(CTGTGTA), );# second regex, split into sub-patterns as callbacksmy @second = (    sub { return qr(CTAAA) },    sub { return qr([AC]{5,100}) },    sub { return qr(TTTGGG) },    sub {        my (@matches) = @_;        # complement the pattern of first.regex.p3        return complement( $matches[3] );    },    sub { return qr(CTT) },    sub { return qr([AG]{10,5000}) },    sub {        my (@matches) = @_;        # complement the pattern of first.regex.p4        return complement( $matches[4] );    },);my $file2 = "CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG";while ( my $file1 = <DATA> ) {    # this pattern will match the full thing in $1, and each sub-section in $2, $3, ...    # @matches will contain (full, $2, $3, $4, $5, $6)    my @matches = ( $file1 =~ m/(($first[0])($first[1])($first[2])($first[3])($first[4]))/g );    # iterate the list of anonymous functions and call each of them,    # passing in the match results of the first match    my $pattern2 = join q{}, map { '(' . $_->(@matches) . ')' } @second;    my @matches2 = ( $file2 =~ m/($pattern2)/ );}__DATA__AAACCCGTGTAATAACAGACGTACTGTGTATTTTTTTGCGACCGAGAAACGGTTCTGTGTATAACAAGGACCCTGTGTA

These are the generated second patterns for your three input substrings.

((?^:CTAAA))((?^:[AC]{5,100}))((?^:TTTGGG))(TCT)((?^:CTT))((?^:[AG]{10,5000}))(GCAT)((?^:CTAAA))((?^:[AC]{5,100}))((?^:TTTGGG))(CC)((?^:CTT))((?^:[AG]{10,5000}))(AA)((?^:CTAAA))((?^:[AC]{5,100}))((?^:TTTGGG))(TTCCT)((?^:CTT))((?^:[AG]{10,5000}))(GG)

If you're not familiar with this, it's what happens if you print a pattern that was constructed with the quoted regex operator qr//.

The pattern matches your example output for the third case. The resulting @matches2 looks like this when dumped out using Data::Printer.

[    [0] "CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG",    [1] "CTAAA",    [2] "ACACC",    [3] "TTTGGG",    [4] "TTCCT",    [5] "CTT",    [6] "AAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAG",    [7] "GG"]

I cannot say anything about speed of this implementation, but I believe it will be reasonable fast.

If you wanted to find other combinations of patterns, all you had to do was replace the sub { ... } entries in those two arrays. If there is a different number than five of them for the first match, you'd also construct that pattern programmatically. I've not done that above to keep things simpler. Here's what it would look like.

my @matches = ( $file1 =~ join q{}, map { "($_)" } @first);

If you want to learn more about this kind of strategy, I suggest you read Mark Jason Dominus' excellent Higher Order Perl, which is available for free as a PDF here.


Using stringr in R

Extract matches to regex_1: "[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)"

reg_1_matches = stringr::str_extract_all(sequences, "[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)")reg_1_matches = unlist(reg_1_matches)

lets assume the matches were:

 reg_1_matches = c("TTTTTTTGCGACCGAGAAACGGTTCTGTGTA", "TAACAAGGACCCTGTGTA")

Use stringr::str_match with capturing groups (...)

df_ps = stringr::str_match(reg_1_matches, "[ACGT]{1,12000}AAC([AG]{2,5})([ACGT]{2,5})CTGTGTA")p3 = df_ps[,2]p4 = df_ps[,3]

Complement

rule_1 = chartr(old= "ACGT", "TGCA", p3)rule_2 = chartr(old= "ACGT", "TGCA", p4)

Construct regex_2

  paste("(CTAAA)[AC]{5,100}(TTTGGG)", rule_1, "(CTT)[AG]{10,5000}", rule_2, sep="") 

all in one go:

reg_1_matches =  stringr::str_extract_all(sequences, "[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)")df_ps = stringr::str_match(reg_1_matches, "[ACGT]{1,12000}AAC([AG]{2,5})([ACGT]{2,5})CTGTGTA")p3 = df_ps[,2]p4 = df_ps[,3]rule_1 = chartr(old= "ACGT", "TGCA", p3)rule_2 = chartr(old= "ACGT", "TGCA", p4)paste("(CTAAA)[AC]{5,100}(TTTGGG)", rule_1, "(CTT)[AG]{10,5000}", rule_2, sep="") 


This question really brings to mind the old saying about regular expressions, though in this case the languages you're matching against are regular, so RE is a good fit for this.

Unfortunately, my Perl is somewhat lacking, but fundamentally this sounds like a Regex problem rather than an R or Perl problem, so I'll do my best to answer it on that basis.

Perl's regex engine supports capture groups. The substrings matching bracketed subexpressions in your regex can be made available after matching:

use feature qw(say);$foo = 'foo';'aaa' =~ /(a)(a+)/;say($1); # => 'a'say($2); # => 'aa'say("Matched!") if 'aaaa' =~ /${2}/;

What I'd suggest doing is bracketing your regex up properly, picking apart the capture groups after matching, and then sticking them together into a new regex, say...

use feature qw(say);'ACGTAACAGAGATCTGTGTA' =~ /([ACGT]{1,12000})(AAC)([AG]{2,5})([ACGT]{2,5})(CTGTGTA)/ ; # Note that I've added a lot of (s and )s here so that the results get sorted into nice groupssay($1); # => 'ACGT'say($2); # => 'AAC'say($3); # => 'AGAG'say($4); # => 'AT'say($5); # => 'CTGTGTA'$complemented_3 = complement($3); # You can probably implement these yourself...$complemented_4 = complement($4);$new_regex = /${complemented_3}[ACGT]+${complemented_4}/;

If the sections have actual meaning, then I'd also advise looking up named capture groups, and giving the results decent names rather than $1, $2, $3....