"vertical" regex matching in an ASCII "image" "vertical" regex matching in an ASCII "image" php php

"vertical" regex matching in an ASCII "image"


Answer to question 1

To answer the first question one could use:

(?xm)                    # ignore comments and whitespace, ^ matches beginning of line^                        # beginning of line(?:    .                    # any character except \n    (?=                  # lookahead        .*+\n            # go to next line        ( \1?+ . )       # add a character to the 1st capturing group        .*+\n            # next line        ( \2?+ . )       # add a character to the 2nd capturing group    ))*?                      # repeat as few times as neededX .*+\n                  # X on the first line and advance to next line\1?+                     # if 1st capturing group is defined, use it, consuming exactly the same number of characters as on the first lineX .*+\n                  # X on the 2nd line and advance to next line\2?+                     # if 2st capturing group is defined, use it, consuming exactly the same number of characters as on the first lineX                        # X on the 3rd line

Online demo

This expression works in Perl, PCRE, Java and should work in .NET.

The expression uses lookaheads with self referencing capturing groups to add a character for every repetition of the lookahead (this is used to "count").

\1?+ means if \1 matches (or is defined) consume it, and don't give it back (don't backtrack). In this case it's equivalent to (?(1) \1 ). Which means match \1 if \1 is defined.

polygenelubricants explains this kinds of lookaheads with backreferences very nicely in his answer for How can we match a^n b^n with Java regex?. (He has also written about other impressive tricks for Java regex involving backreferences and lookarounds.)

Answer to question 2

Plain matching

When just using matching and requiring the answer (count) in the number of matches, then the question 2 answer would be:

It can not be directly solved in regex flavors that have a limited lookbehind. While other flavors like Java and .NET could (as for example in m.buettner's .NET solution).

Thus plain regex matches in Perl and PCRE (PHP, etc) cannot directly answer this question in this case.

(Semi?)proof

Assume that no variable length lookbehinds are available.

You have to in some way count the number of characters on a line before an X.
Only way to do that is to match them, and since no variable length lookbehinds are available you have to start the match (at least) at the beginning of the line.
If you start the match at the beginning of a line you can only get at most one match per line.

Since there can be multiple occurrences per line, this would not count them all and would not give a correct answer.

Length/indirect solution

On the other hand if we accept the answer as the length of a match or substitution result, then the 2nd question can be answered in PCRE and Perl (and other flavors).

This solution is based on/inspired by m.buettner's nice "partial PCRE solution".

One could simply replace all matches of the following expression with $3, getting the answer to question two (the number of patterns of interests) as the length of the resulting string.

^(?:    (?:                   # match .+? characters        .        (?=               # counting the same number on the following two lines            .*+\n            ( \1?+ . )            .*+\n            ( \2?+ . )        )    )+?    (?<= X )              # till the above consumes an X    (?=                   # that matches the following conditions        .*+\n        \1?+        (?<= X )        .*+\n        \2?+        (?<= X )    )    (?=                   # count the number of matches        .*+\n        ( \3?+ . )        # the number of matches = length of $3    ))*                        # repeat as long as there are matches on this line.*\n?                     # remove the rest of the line

Which in Perl could be written as:

$in =~ s/regex/$3/gmx;$count = length $in;

Online demo

This expression is similar to the solution to question 1 above, with some modifications to include X in the characters matched in the first lookahead, wrapped with a quantifier and counting number of matches of the quantifier.

Except for direct matches this is as close as it gets (extra code wise besides regex), and could be an acceptable answer to question 2.

Test cases

Some test cases and results for the above solution. Result showing the numerical answer (length of the resulting string) and in parenthesis the resulting string after the substitution(s).

Test #0:--------------------XXXresult: 1 (X)Test #1:--------------------..X......X......X....result: 1 (.)Test #2:--------------------..X.X....X.X......X..result: 1 (.)Test #3:--------------------..X......X.......X...result: 0 ()Test #4:--------------------..X.......X.....X....result: 0 ()Test #5:--------------------....X...X..X...X.....result: 0 ()Test #6:--------------------.X..X...X.X....X.X...result: 1 (.)Test #7:--------------------.X..X...X..X...X..X..result: 2 (.X)Test #8:--------------------XXXXXXXXXresult: 3 (XXX)Test #9:--------------------X.X.XXXXXXXXXXX.X.X.result: 5 (XXXXX)Test #10:--------------------1....X.......2..X..X...X....3X.X...X..X.....4X....XXXXXX.....5X..XXX...........6.....X..........7.........X....X8..X......X....X....9..X......X....X....X...A....X.....B.X..X..C.....XXXXXXXXX.result: 8 (3458.XXX)


Edit

The following solutions have two grave problems:

  1. They can't match multiple XXX sequences starting on the same line, as the pos advances too much.
  2. The second solution is incorrect: it matches consecutive lines where two X are above each other. There don't neccessarily have to be three in a row.

Therefore, all upvotes (and the bounty) should go to either of m.buettner's comprehensive .NET answer or the fascinating PCRE answer from Qtax himself.


Original Answer

This is an answer using embedding of Perl code into regexes. Because a Perl regex can use code to assert arbitrary conditions inside regexes or emit partial regexes, they are not limited to matching regular languages or context-free languages, but can match some parts of languages higher up in the Chomsky hierarchy.

The language you want to match can be described in regex terms as

^ .{n} X .*\n  .{n} X .*\n  .{n} X

where n is a number. This is about as complex as matching the anbncn language which is the canonical example for a context-sensitive language.

We can match the first line easily, and use some Perl code to emit the regex for the other lines:

    /^ (.*?) X       (?: .*\n (??{"." x length($1)}) X){2}    /mx

That was short! What does it do?

  • ^ (.*?) X anchores at the start of a line, matches as few non-newline characters as possible and then the X. We remember the line up to the X as capture group $1.

  • We repeat a group two times which matches the rest of the line, a newline, and then injects a regex that matches a string of the same length as $1. After that, there must be an X.

This regex will now match every string that has three X on top of each other.

If we want to extract all such sequences, we'll have to be nifty. Because sequences may overlap, e.g.

.XXXXXX.

the position where the next match starts must not proceed past the first X. We can do this via a lookbehind and lookahead. Perl only supports constant-length lookbehind, but has the \K escape which provides similar semantics. Thus

/^ (.*?) \K X   (?=( (?: .*\n (??{"."x length($1)}) X ){2} ))/gmx

will match every sequence of three vertical Xes. Testing time:

$ perl -E'my$_=join"",<>; say "===\n$1X$2" while /^(.*?)\KX(?=((?:.*\n(??{"."x length($1)})X){2}))/gmx' <<'END'....X.........X..X...X....X.X...X..X.....X....XXXXXX.....X..XXX................X........................X..X...........X......X...........X....X.......X.....END===..X..X...X....X.X...X..X.....X....XXXXX===X.X...X..X.....X....XXXXXX.....X===X....XXXXXX.....X..XXX................X===..............X..X...........X......X...........X

Note: this relies on experimental regex features that are available from at least Perl 5, v10 onward. The code was tested with a v16 perl.


Solution without embedded code

Let us look at the lines

...X...\n...X..\n

We want to assert that the leading ... part of each line is of same length. We can do so by recursion with base case X.*\n:

(X.*\n|.(?-1).)X

If we anchor that at the start of a line, we can match two vertical Xes. To match more than two lines, we have to do the recursion in a lookahead and then advance the match position to the next line, and repeat. For this, we simply match .*\n.

This results in the following regex which can match a string with three vertical Xes:

/ ^  (?:    (?=( X.*\n | .(?-1). ) X)    .*\n # go to next line  ){2}/mx

But this isn't good enough, as we want to match all such sequences. To do this, we essentially put the whole regex into a lookahead. The regex engine makes sure to advance the position every time to produce a new match.

/ ^  (?=    (      (?:          (?= (X.*\n | .(?-1). ) X)          .*\n # go to next line      ){2}      .* # include next line in $1    )  )/mx

Testing time:

$ perl -E'my$_=join"",<>; say "===\n$1" while /^(?=((?:(?=(X.*\n|.(?-1).)X).*\n){2}.*))/gmx' <<'END'....X.........X..X...X....X.X...X..X.....X....XXXXXX.....X..XXX................X........................X..X...........X......X...........X....X.......X.....END===..X..X...X....X.X...X..X.....X....XXXXXX.....===X.X...X..X.....X....XXXXXX.....X..XXX...........===X....XXXXXX.....X..XXX................X..........===..............X..X...........X......X...........X....X...

So this works as well as the solution with embedded code, that is, it matches each group of lines with vertical Xes, not each group of Xes. (Actually, this solution seems more fragile to me than embedded code)


First of all: brilliant question. I think it can be very instructive to try to take regex engines to their limits.

The basic .NET solution

You guys said in the comments that it would be easy with .NET, but since there is no answer for that yet, I thought I'd write one.

You can solve both question 1. and 2. using .NET's variable-length lookbehind and balancing groups. Most of the work is done by the balancing groups, but the variable-length lookbehind is crucial to be able to detected multiple matches starting on the same line.

Anyway, here is the pattern:

(?<=                  # lookbehind counts position of X into stack  ^(?:(?<a>).)*       # push an empty capture on the 'a' stack for each character                      # in front of X)                     # end of lookbehindX                     # match X(?=.*\n               # lookahead checks that there are two more Xs right below  (?:(?<-a>)(?<b>).)* # while we can pop an element from stack 'a', push an                      # element onto 'b' and consume a character  (?(a)(?!))          # make sure that stack 'a' is empty  X.*\n               # match X and the rest of the line  (?:(?<-b>).)*       # while we can pop an element from stack 'b', and consume                      # a character  (?(b)(?!))          # make sure that stack 'b' is empty  X                   # match a final X)                     # end of lookahead

This pattern has to be used with RegexOptions.Multiline for the ^ to match the beginnings of lines (and obviously with RegexOptions.IgnorePatternWhitespace for freespacing mode to work).

Here are some additional comments:

By putting everything except the initial X into lookarounds, we have no problems with overlapping matches or even matches starting on the same line. However, the lookbehind has to be of variable-length which certainly constrains any solution of this kind to .NET.

The rest relies on a good grasp on balancing groups. I won't go into this in detail here, because it makes for quite long answers in itself. (see MSDN and this blog post for even more information)

The lookbehind just matches ^.*, so everything until the start of the line, but for every . we push an empty capture onto stack a, thereby counting the position of our X as the size of the stack.

Then after consuming the rest of the line in the lookahead, we match again just .*, but before consuming each ., we pop one element from stack a (which leads to failure, once a is empty) and push an empty capture onto b (so that we don't forget how many characters there have to be for the third line).

To make sure that we really empty the entire stack, we use (?(a)(?!)). This is a conditional pattern, that tries to match (?!) if stack a is not empty (and is simply skipped otherwise). And (?!) is an empty negative lookahead, which always fails. Hence, this simply encodes, "is a not empty? fail. otherwise, continue".

Now that know we've consumed exactly the right amount of characters in the new line, we try to match a X and the rest of the line. Then we repeat the same process again with stack b. Now there is no need to push onto any new stack, because if this works, we're done. We check that b is empty after this, and match a third X.

Finally, an optimization side note: this pattern still works if all repetitions are wrapped in atomic groups (thereby emulating possessive quantifiers, which are not supported by .NET)! This would save a lot of backtracking. Moreover, if we put at least the stack-popping quantifiers in atomic groups, we can get rid of both (?(...)(?!)) checks (as these are only needed for cases, where the preceding repetition had to backtrack).

The full .NET solution

(Only the bravest of adventurers should follow me into the really dark cave I'm about to descend into...)

As discussed in the comments, this solution has one drawback: it counts overlapping matches. E.g.

..X....X....X....X..

Gives two matches, one in the first and one in the second line. We'd like to avoid this, and report only one match (or two if there are 6 to 8 Xs and three if there are 9 to 11 Xs and so on). Moreover, we want to report the matches at the 1st, 4th, 7th, ... X.

We can adjust the above pattern to allow for this solution by requiring that the first X is preceded by an integer multiple of 3 other Xs that statisfy our requirements. The basic idea of checking this uses the same stack manipulation as before (except we shift things around between 3 stacks so that after finding three Xs we end up where we started). To do this we have to fiddle a bit with the lookbehind.

There is a catch though. .NET's variable-length lookbehind uses another .NET-unique feature, RightToLeftMode, in which the pattern is read (and matched) from right to left. Normally this doesn't need to bother us, but when we combine this with balancing groups, we might be in for some unpleasant surprises. In particular, when considering how our capture stacks evolve, we need to construct (and read) the expression from right to left (or bottom to top) as well.

So when you read the following expression (and my annotations), start at the end of the outermost lookbehind (you'll have to scroll a bit) - i.e. just before the only top-level X; then read all the way up to the top. And then continue after the lookbehind.

(?<=                    # note that the lookbehind below does NOT affect the state of stack 'a'!  # in fact, negative lookarounds can never change any capturing state.  # this is because they have to fail for the engine to continue matching.  # and if they fail, the engine needs to backtrack out of them, in which  # case the previous capturing state will be restored.  (?<!                # if we get here, there is another X on top of the last                      # one in the loop, and the pattern fails    ^                 # make sure we reached the beginning of the line    (?(a)(?!))        # make sure that stack 'a' is empty    (?:(?<-a>).)*     # while we can pop an element from stack 'a', and consume                      # a character    X.*\n             # consume the next line and a potential X  )  # at this point we know that there are less than 3 Xs in the same column  # above this position. but there might still be one or two more. these  # are the cases we now have to eliminate, and we use a nested negative  # lookbehind for this. the lookbehind simply checks the next row and  # asserts that there is no further X in the same column.  # this, together with the loop, below means that the X we are going to match  # is either the topmost in its column or preceded by an integer multiple of 3  # Xs - exactly what we are looking for.  (?:    # at this point we've advanced the lookbehind's "cursor" by exactly 3 Xs    # in the same column, AND we've restored the same amount of captures on    # stack 'a', so we're left in exactly the same state as before and can    # potentially match another 3 Xs upwards this way.    # the fact that stack 'a' is unaffected by a full iteration of this loop is    # also crucial for the later (lookahead) part to work regardless of the    # amount of Xs we've looked at here.    ^                 # make sure we reached the beginning of the line    (?(c)(?!))        # make sure that stack 'a' is empty    (?:(?<-c>)(?<a>).)* # while we can pop an element from stack 'c', push an                      # element onto 'a' and consume a character    X.*\n             # consume the next line and a potential X    (?(b)(?!))        # make sure that stack 'b' is empty    (?:(?<-b>)(?<c>).)* # while we can pop an element from stack 'b', push an                      # element onto 'c' and consume a character    X.*\n             # consume the next line and a potential X    (?(a)(?!))        # make sure that stack 'a' is empty    (?:(?<-a>)(?<b>).)* # while we can pop an element from stack 'a', push an                      # element onto 'b' and consume a character    X.*\n             # consume the next line and a potential X  )*                  # this non-capturing group will match exactly 3 leading                      # Xs in the same column. we repeat this group 0 or more                      # times to match an integer-multiple of 3 occurrences.  ^                   # make sure we reached the beginning of the line  (?:(?<a>).)*        # push an empty capture on the 'a' stack for each                      # character in front of X)                     # end of lookbehind (or rather beginning)# the rest is the same as before    X                     # match X(?=.*\n               # lookahead checks that there are two more Xs right below  (?:(?<-a>)(?<b>).)* # while we can pop an element from stack 'a', push an                      # element onto 'b' and consume a character  (?(a)(?!))          # make sure that stack 'a' is empty  X.*\n               # match X and the rest of the line  (?:(?<-b>).)*       # while we can pop an element from stack 'b', and consume                      # a character  (?(b)(?!))          # make sure that stack 'b' is empty  X                   # match a final X)                     # end of lookahead

Working demo on RegexHero.net.

I interspersed all explanation right with the pattern this time. So if you read the pattern in the way I recommended above, you get the explanation right when you need it...

Now that was one hell of a beast. But it satisfies the entire specification now and shows off just how powerful .NET's regex flavor really is. And, although this seems quite horrible, I think (once you realise the right-to-left-thing) this is much more easily understandable than a comparable solution with PCRE (using recursion or otherwise).

As Kobi mentioned in a comment below, this could be shortened a good bit, if you accept that your results are found in multiple captures of a single match (e.g., if you have a column of 7 Xs you only get one match, but with 2 captures in a certain group). You can do this by repeating the main (lookahead) part 1 or more times and capturing the initial X (put everything in a lookahead though). Then the lookbehind does not need to count off triples of Xs, but only has to check that there is no leading X. This would probably cut the size of the pattern in half.

The partial PCRE solution

(If only the bravest of adventurers followed me through the last solution, I am probably only left with madmen on the next journey...)

To prove what I just said about how the above solution compares to PCRE, let's look at how we can even remotely solve the full problem in PCRE. We'll have to work a good bit harder without variable-length lookbehinds and balancing groups.

Qtax (the OP) provided a brilliant solution to his first question (checking whether the string contains any X-column) using self-referencing groups to count. This is a very elegant and compact solution. But because each match goes from the beginning of the line to the X that starts the column, and matches cannot overlap, we can't get multiple matches per line. We could try to put everything in a lookahead (so that nothing is actually matched), but two zero-width matches will also never start at the same position - so we'll still get only one match per candidate line.

However it is indeed possible to solve at least the first part of question 2 with PCRE: count the number of columns starting in each line (and hence to total amount of X columns). Since we cannot get this count in the form of individual matches (see previous paragraph), and we cannot get this count in the form of individual groups or captures (since PCRE provides only a fixed and finite number of captures, as opposed to .NET). What we can do instead is to encode the number of columns in the matches.

Here is how: for each line we check if there's a column starting. If so, we include one character in a certain capturing group. Then, before reporting a successful match, we try to find as many further columns as possible - for each one adding a character to that particular group. By doing this, we encode the number of columns starting in each line in the length of that particular capture.

Actually realizing this concept in a regex is a lot more complicated than it may first sound (and it already sounds quite complicated). Anyway, here it is:

^                        (?:(?|  (?(5)(?![\s\S]*+\5))        (?!(?!)()())   (?=    (?:      .                        (?=                        .*+\n                    ( \3? . )           .*+\n                ( \4? . )          )    )*?                  X .*+\n              \3                   X .*+\n              \4                 )  ()|  (?(5)(?=[\s\S]*+\5)|(?!))  (?:    .    (?=      .*+\n      ( \1? .)      .*+\n      ( \2? .)    )  )+?  (?=    (?<=X).*+\n    (\1)             (?<=X).*+\n    (\2)             (?<=X)       )  (?=    ([\s\S])       [\s\S]*    ([\s\S] (?(6)\6))  )){2})+

(Actually, it's a bit easier than that - see Qtax's answer for how to simplify this approach. I'll leave this approach here anyway for academic reasons, as some very advanced and interesting techniques can be learned from it - see the summary at the end.)

Yes, there are no annotations. I figured, no one would actually read them anyway, so instead I'll try to break this expression down in parts (I'll go for a top-down approach).

So let's look at the outer layer of onion from hell:

^                        (?:(?|  checkForNextColumn|  countAndAdvance){2})+

So our matches are again anchored to the beginnings of lines. Then we have a (?:...{2})+ which means an even number of repetitions of something. And that something is an alternation of two subpatterns. These subpatterns represent the steps I mentioned above. The first one checks that there is another column starting in the current line, the second one registers a count and prepares the engine's state for another application of the first subpattern. So control is given to the second pattern - the first just checks for another column using a lookahead and is hence a zero-width pattern. This is why I cannot simply wrap everything in + but have to do the {2})+ thing - otherwise the zero-width component would be tried only once; that's a necessary optimization applied by pretty much all engines to avoid infinite loops with patterns like (a*)+.

There is one more (very important detail): I used (?|...) for the alternation. In this kind of grouping, each alternative starts with the same group number. Hence in /(?|(a)|(b))/ both a and b can be captured into group 1. This is the crucial trick that allows "communication" between subpatterns, as they can modify the same groups.

Anyway... so we have these two subpatterns. We'd like to make sure that control really alternates between them. So that each group fails if it was the last one that matched. We do this by wrapping the pattern in some grouping-and-referencing magic:

^(?:(?|  (?(5)(?![\s\S]*+\5))       # if group 5 has matched before make sure that                             # it didn't match empty  checkForNextColumn         # contains 4 capturing groups  ()                         # this is group 5, match empty|  (?(5)(?=[\s\S]*+\5)|(?!))  # make sure that group 5 is defined and that it                             # matched empty  advanceEngineState         # contains 4 capturing groups  (?=    ([\s\S])                 # this is group 5, match non-empty    [\s\S]*                  # advance to the end very end of the string    ([\s\S] (?(6)\6))             # add a character from the end of the string to                             # group 6  )){2})+

So at the end of each alternative, we'll invalidate the condition for this alternative to even start matching. At the end of the second alternative we'll also include a character into group 6, using the technique outlined by Qtax. This is the counting step. I.e., group 6 will contain as many characters as there are columns starting in the current line.

Now checkForNextColumn will really just be Qtax's solution inside a lookahead. It needs one more modification though and to justify this we'll look into advanceEngineState first.

Let's think about how we would want to modify the state, for Qtax's solution to match a second column in a row. Say we have input

..X..X....X..X....X..X..

and we want to find the second column. This could be accomplished, by starting the match from the position just after the first X and having groups \1 and \2 already initialised to the first three characters (..X) of rows 2 and 3, respectively (instead of them being empty).

Now let's try to do this: match everything up to and including the next X that starts a column, then fill two groups with the corresponding line-prefixes for use in the checkForNextColumn pattern. This is again pretty much Qtax's pattern, except that we count the X in (instead of stopping right before it), and that we need to add the capturing into a separate group. So here is advanceEngineState:

(?:  .  (?=    .*+\n    ( \1? .)    .*+\n    ( \2? .)  ))+?(?=  (?<=X) .*+\n  (\1)          (?<=X) .*+\n  (\2)          (?<=X))

Note how I turned the Xs into lookbehinds, to go one character further, and how I effectively copy the final contents of \1 into \3 and those of \2 into \4.

So if we now use Qtax's solution as checkForNextColumn in a lookahead, using groups \3 and \4 instead of \1 and \2, we should be done.

But just how do we make those groups \3 and \4 instead of \1 and \2? We could start the pattern with ()(), which would always match, without affecting the engine's cursor, but increase the group count by 2. However, this is problematic: this resets groups 1 and 2 to empty strings, which if we find a second column, advanceEngineState will be in an inconsistent state (as the engine's global position has been advanced, but the counting groups are zero again). So we want to get those two groups into the pattern, but without affecting what they are currently capturing. We can do this by utilizing something I already mentioned with the .NET solutions: groups in negative lookarounds do not affect the captured contents (because the engine needs to backtrack out of the lookaround to proceed). Hence we can use (?!(?!)()()) (a negative lookahead that can never cause the pattern to fail) to include two sets of parentheses in our pattern, that are never used. This allows us to work with groups 3 and 4 in our first subpattern, while keeping groups 1 and 2 untouched for the second subpatterns next iteration. In conclusion this is checkForNextColumn:

(?!(?!)()()) (?=  (?:    .                      (?=                      .*+\n                  ( \3? . )         .*+\n              ( \4? . )        )  )*?                X .*+\n            \3                 X .*+\n            \4               )

Which, for the most part actually looks really familiar.

So this is it. Running this against some input will give us a group 6 which contains one capture for each line that has a column starting - and the capture's length will tell us how many columns started there.

Yes, it really works (live demo).

Note that this (like the basic .NET solution) will overcount columns that are more than 3 Xs long. I suppose it is possible to correct this count with lookaheads (in a similar way to the lookbehind of the full .NET solution), but this is left as an exercise to the reader.

It's a bit unfortunate that the base problem of this solution is already very convoluted and bloats the solution (75% of the lines are mostly just copies of Qtax's solution). Because the surrounding framework has some really interesting techniques and lessons:

  • We can have multiple subpatterns that accomplish specific matching/counting tasks, and have them "communicate" through mutual capturing groups, by putting them in a (?|...) alternation and looping over them.
  • We can force zero-width alternatives to be carried out over and over again by wrapping them in a finite quantifier like {2} before putting everything into +.
  • Group numbers can be skipped in one subpattern (without affecting the captured contents) by putting them into a never-failing negative lookahead like (?!(?!)()).
  • Control can be passed back and forth between subpatterns by capturing something or nothing in a certain group that is checked upon entering the alternation.

This allows for some very powerful computations (I've seen claims that PCRE is in fact Turing-complete) - although this is certainly the wrong approach for productive use. But still trying to understand (and come up) with such solutions can be a very challenging and somehow rewarding exercise in problem solving.