How to write a recursive regex that matches nested parentheses?
This pattern works:
$pattern = '~ \( (?: [^()]+ | (?R) )*+ \) ~x';
The content inside parenthesis is simply describe:
"all that is not parenthesis OR recursion (= other parenthesis)" x 0 or more times
If you want to catch all substrings inside parenthesis, you must put this pattern inside a lookahead to obtain all overlapping results:
$pattern = '~(?= ( \( (?: [^()]+ | (?1) )*+ \) ) )~x';preg_match_all($pattern, $subject, $matches);print_r($matches[1]);
Note that I have added a capturing group and I have replaced (?R)
by (?1)
:
(?R) -> refers to the whole pattern (You can write (?0) too)(?1) -> refers to the first capturing group
What is this lookahead trick?
A subpattern inside a lookahead (or a lookbehind) doesn't match anything, it's only an assertion (a test). Thus, it allows to check the same substring several times.
If you display the whole pattern results (print_r($matches[0]);
), you will see that all results are empty strings. The only way to obtain the substrings found by the subpattern inside the lookahead is to enclose the subpattern in a capturing group.
Note: the recursive subpattern can be improved like this:
\( [^()]*+ (?: (?R) [^()]* )*+ \)
When I found this answer I wasn't able to figure out how to modify the pattern to work with my own delimiters which where {
and }
. So my approach was to make it more generic.
Here is a script to generate the regex pattern with your own variable left and right delimiters.
$delimiter_wrap = '~';$delimiter_left = '{';/* put YOUR left delimiter here. */$delimiter_right = '}';/* put YOUR right delimiter here. */$delimiter_left = preg_quote( $delimiter_left, $delimiter_wrap );$delimiter_right = preg_quote( $delimiter_right, $delimiter_wrap );$pattern = $delimiter_wrap . $delimiter_left . '((?:[^' . $delimiter_left . $delimiter_right . ']++|(?R))*)' . $delimiter_right . $delimiter_wrap;/* Now you can use the generated pattern. */preg_match_all( $pattern, $subject, $matches );
The following code uses my Parser class (it's under CC-BY 3.0), it works on UTF-8 (thanks to my UTF8 class).
The way it works is by using a recursive function to iterate over the string. It will call itself each time it finds a (
. It will also detect missmatched pairs when it reaches the end of the string without finding the corresponding )
.
Also, this code takes a $callback parameter you can use to process each piece it finds. The callback recieves two parameters: 1) the string, and 2) the level (0 = deepest). Whatever the callback returns will be replaced in the contents of the string (this changes are visible at callback of higher level).
Note: the code does not includes type checks.
Non-recursive part:
function ParseParenthesis(/*string*/ $string, /*function*/ $callback){ //Create a new parser object $parser = new Parser($string); //Call the recursive part $result = ParseParenthesisFragment($parser, $callback); if ($result['close']) { return $result['contents']; } else { //UNEXPECTED END OF STRING // throw new Exception('UNEXPECTED END OF STRING'); return false; }}
Recursive part:
function ParseParenthesisFragment(/*parser*/ $parser, /*function*/ $callback){ $contents = ''; $level = 0; while(true) { $parenthesis = array('(', ')'); // Jump to the first/next "(" or ")" $new = $parser->ConsumeUntil($parenthesis); $parser->Flush(); //<- Flush is just an optimization // Append what we got so far $contents .= $new; // Read the "(" or ")" $element = $parser->Consume($parenthesis); if ($element === '(') //If we found "(" { //OPEN $result = ParseParenthesisFragment($parser, $callback); if ($result['close']) { // It was closed, all ok // Update the level of this iteration $newLevel = $result['level'] + 1; if ($newLevel > $level) { $level = $newLevel; } // Call the callback $new = call_user_func ( $callback, $result['contents'], $level ); // Append what we got $contents .= $new; } else { //UNEXPECTED END OF STRING // Don't call the callback for missmatched parenthesis // just append and return return array ( 'close' => false, 'contents' => $contents.$result['contents'] ); } } else if ($element == ')') //If we found a ")" { //CLOSE return array ( 'close' => true, 'contents' => $contents, 'level' => $level ); } else if ($result['status'] === null) { //END OF STRING return array ( 'close' => false, 'contents' => $contents ); } }}