Coalescing regular expressions in PHP Coalescing regular expressions in PHP php php

Coalescing regular expressions in PHP


I see that porneL actually described a bunch of this, but this handles most of the problem. It cancels modifiers set in previous sub-expressions (which the other answer missed) and sets modifiers as specified in each sub-expression. It also handles non-slash delimiters (I could not find a specification of what characters are allowed here so I used ., you may want to narrow further).

One weakness is it doesn't handle back-references within expressions. My biggest concern with that is the limitations of back-references themselves. I'll leave that as an exercise to the reader/questioner.

// Pass as many expressions as you'd likefunction preg_magic_coalesce() {    $active_modifiers = array();    $expression = '/(?:';    $sub_expressions = array();    foreach(func_get_args() as $arg) {        // Determine modifiers from sub-expression        if(preg_match('/^(.)(.*)\1([eimsuxADJSUX]+)$/', $arg, $matches)) {            $modifiers = preg_split('//', $matches[3]);            if($modifiers[0] == '') {                array_shift($modifiers);            }            if($modifiers[(count($modifiers) - 1)] == '') {                array_pop($modifiers);            }            $cancel_modifiers = $active_modifiers;            foreach($cancel_modifiers as $key => $modifier) {                if(in_array($modifier, $modifiers)) {                    unset($cancel_modifiers[$key]);                }            }            $active_modifiers = $modifiers;        } elseif(preg_match('/(.)(.*)\1$/', $arg)) {            $cancel_modifiers = $active_modifiers;            $active_modifiers = array();        }        // If expression has modifiers, include them in sub-expression        $sub_modifier = '(?';        $sub_modifier .= implode('', $active_modifiers);        // Cancel modifiers from preceding sub-expression        if(count($cancel_modifiers) > 0) {            $sub_modifier .= '-' . implode('-', $cancel_modifiers);        }        $sub_modifier .= ')';        $sub_expression = preg_replace('/^(.)(.*)\1[eimsuxADJSUX]*$/', $sub_modifier . '$2', $arg);        // Properly escape slashes        $sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);        $sub_expressions[] = $sub_expression;    }    // Join expressions    $expression .= implode('|', $sub_expressions);    $expression .= ')/';    return $expression;}

Edit: I've rewritten this (because I'm OCD) and ended up with:

function preg_magic_coalesce($expressions = array(), $global_modifier = '') {    if(!preg_match('/^((?:-?[eimsuxADJSUX])+)$/', $global_modifier)) {        $global_modifier = '';    }    $expression = '/(?:';    $sub_expressions = array();    foreach($expressions as $sub_expression) {        $active_modifiers = array();        // Determine modifiers from sub-expression        if(preg_match('/^(.)(.*)\1((?:-?[eimsuxADJSUX])+)$/', $sub_expression, $matches)) {            $active_modifiers = preg_split('/(-?[eimsuxADJSUX])/',                $matches[3], -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);        }        // If expression has modifiers, include them in sub-expression        if(count($active_modifiers) > 0) {            $replacement = '(?';            $replacement .= implode('', $active_modifiers);            $replacement .= ':$2)';        } else {            $replacement = '$2';        }        $sub_expression = preg_replace('/^(.)(.*)\1(?:(?:-?[eimsuxADJSUX])*)$/',            $replacement, $sub_expression);        // Properly escape slashes if another delimiter was used        $sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);        $sub_expressions[] = $sub_expression;    }    // Join expressions    $expression .= implode('|', $sub_expressions);    $expression .= ')/' . $global_modifier;    return $expression;}

It now uses (?modifiers:sub-expression) rather than (?modifiers)sub-expression|(?cancel-modifiers)sub-expression but I've noticed that both have some weird modifier side-effects. For instance, in both cases if a sub-expression has a /u modifier, it will fail to match (but if you pass 'u' as the second argument of the new function, that will match just fine).


  1. Strip delimiters and flags from each. This regex should do it:

    /^(.)(.*)\1([imsxeADSUXJu]*)$/
  2. Join expressions together. You'll need non-capturing parenthesis to inject flags:

    "(?$flags1:$regexp1)|(?$flags2:$regexp2)"
  3. If there are any back references, count capturing parenthesis and update back references accordingly (e.g. properly joined /(.)x\1/ and /(.)y\1/ is /(.)x\1|(.)y\2/ ).


EDIT

I’ve rewritten the code! It now contains the changes that are listed as follows. Additionally, I've done extensive tests (which I won’t post here because they’re too many) to look for errors. So far, I haven’t found any.

  • The function is now split into two parts: There’s a separate function preg_split which takes a regular expression and returns an array containing the bare expression (without delimiters) and an array of modifiers. This might come in handy (it already has, in fact; this is why I made this change).

  • The code now correctly handles back-references. This was necessary for my purpose after all. It wasn’t difficult to add, the regular expression used to capture the back-references just looks weird (and may actually be extremely inefficient, it looks NP-hard to me – but that’s only an intuition and only applies in weird edge cases). By the way, does anyone know a better way of checking for an uneven number of matches than my way? Negative lookbehinds won't work here because they only accept fixed-length strings instead of regular expressions. However, I need the regex here to test whether the preceeding backslash is actually escaped itself.

    Additionally, I don’t know how good PHP is at caching anonymous create_function use. Performance-wise, this might not be the best solution but it seems good enough.

  • I’ve fixed a bug in the sanity check.

  • I’ve removed the cancellation of obsolete modifiers since my tests show that it isn't necessary.

By the way, this code is one of the core components of a syntax highlighter for various languages that I’m working on in PHP since I’m not satisfied with the alternatives listed elsewhere.

Thanks!

porneL, eyelidlessness, amazing work! Many, many thanks. I had actually given up.

I've built upon your solution and I'd like to share it here. I didn't implement re-numbering back-references since this isn't relevant in my case (I think …). Perhaps this will become necessary later, though.

Some Questions …

One thing, @eyelidlessness: Why do you feel the necessity to cancel old modifiers? As far as I see it, this isn't necessary since the modifiers are only applied locally anyway.Ah yes, one other thing. Your escaping of the delimiter seems overly complicated. Care to explain why you think this is needed? I believe my version should work as well but I could be very wrong.

Also, I've changed the signature of your function to match my needs. I also thing that my version is more generally useful. Again, I might be wrong.

BTW, you should now realize the importance of real names on SO. ;-) I can't give you real credit in the code. :-/

The Code

Anyway, I'd like to share my result so far because I can't believe that nobody else ever needs something like that. The code seems to work very well. Extensive tests are yet to be done, though. Please comment!

And without further ado …

/** * Merges several regular expressions into one, using the indicated 'glue'. * * This function takes care of individual modifiers so it's safe to use * <em>different</em> modifiers on the individual expressions. The order of * sub-matches is preserved as well. Numbered back-references are adapted to * the new overall sub-match count. This means that it's safe to use numbered * back-refences in the individual expressions! * If {@link $names} is given, the individual expressions are captured in * named sub-matches using the contents of that array as names. * Matching pair-delimiters (e.g. <code>"{…}"</code>) are currently * <strong>not</strong> supported. * * The function assumes that all regular expressions are well-formed. * Behaviour is undefined if they aren't. * * This function was created after a {@link https://stackoverflow.com/questions/244959/ * StackOverflow discussion}. Much of it was written or thought of by * “porneL” and “eyelidlessness”. Many thanks to both of them. * * @param string $glue  A string to insert between the individual expressions. *      This should usually be either the empty string, indicating *      concatenation, or the pipe (<code>|</code>), indicating alternation. *      Notice that this string might have to be escaped since it is treated *      like a normal character in a regular expression (i.e. <code>/</code>) *      will end the expression and result in an invalid output. * @param array $expressions    The expressions to merge. The expressions may *      have arbitrary different delimiters and modifiers. * @param array $names  Optional. This is either an empty array or an array of *      strings of the same length as {@link $expressions}. In that case, *      the strings of this array are used to create named sub-matches for the *      expressions. * @return string An string representing a regular expression equivalent to the *      merged expressions. Returns <code>FALSE</code> if an error occurred. */function preg_merge($glue, array $expressions, array $names = array()) {    // … then, a miracle occurs.    // Sanity check …    $use_names = ($names !== null and count($names) !== 0);    if (        $use_names and count($names) !== count($expressions) or        !is_string($glue)    )        return false;    $result = array();    // For keeping track of the names for sub-matches.    $names_count = 0;    // For keeping track of *all* captures to re-adjust backreferences.    $capture_count = 0;    foreach ($expressions as $expression) {        if ($use_names)            $name = str_replace(' ', '_', $names[$names_count++]);        // Get delimiters and modifiers:        $stripped = preg_strip($expression);        if ($stripped === false)            return false;        list($sub_expr, $modifiers) = $stripped;        // Re-adjust backreferences:        // We assume that the expression is correct and therefore don't check        // for matching parentheses.        $number_of_captures = preg_match_all('/\([^?]|\(\?[^:]/', $sub_expr, $_);        if ($number_of_captures === false)            return false;        if ($number_of_captures > 0) {            // NB: This looks NP-hard. Consider replacing.            $backref_expr = '/                (                # Only match when not escaped:                    [^\\\\]      # guarantee an even number of backslashes                    (\\\\*?)\\2  # (twice n, preceded by something else).                )                \\\\ (\d)        # Backslash followed by a digit.            /x';            $sub_expr = preg_replace_callback(                $backref_expr,                create_function(                    '$m',                    'return $m[1] . "\\\\" . ((int)$m[3] + ' . $capture_count . ');'                ),                $sub_expr            );            $capture_count += $number_of_captures;        }        // Last, construct the new sub-match:        $modifiers = implode('', $modifiers);        $sub_modifiers = "(?$modifiers)";        if ($sub_modifiers === '(?)')            $sub_modifiers = '';        $sub_name = $use_names ? "?<$name>" : '?:';        $new_expr = "($sub_name$sub_modifiers$sub_expr)";        $result[] = $new_expr;    }    return '/' . implode($glue, $result) . '/';}/** * Strips a regular expression string off its delimiters and modifiers. * Additionally, normalize the delimiters (i.e. reformat the pattern so that * it could have used '/' as delimiter). * * @param string $expression The regular expression string to strip. * @return array An array whose first entry is the expression itself, the *      second an array of delimiters. If the argument is not a valid regular *      expression, returns <code>FALSE</code>. * */function preg_strip($expression) {    if (preg_match('/^(.)(.*)\\1([imsxeADSUXJu]*)$/s', $expression, $matches) !== 1)        return false;    $delim = $matches[1];    $sub_expr = $matches[2];    if ($delim !== '/') {        // Replace occurrences by the escaped delimiter by its unescaped        // version and escape new delimiter.        $sub_expr = str_replace("\\$delim", $delim, $sub_expr);        $sub_expr = str_replace('/', '\\/', $sub_expr);    }    $modifiers = $matches[3] === '' ? array() : str_split(trim($matches[3]));    return array($sub_expr, $modifiers);}

PS: I've made this posting community wiki editable. You know what this means …!