Negative Lookahead Regex greed (why is .*? too greedy) Negative Lookahead Regex greed (why is .*? too greedy) python python

Negative Lookahead Regex greed (why is .*? too greedy)


#!/usr/bin/perlsub test_re {    $arg    = $_[0];    $INSULTSTR = $_[1];    $INSULTSTR =~ /(?:^|\.\s*)(?:(?![^.]*?$arg[^.]*\.))([^.]*\.)/;    if ($1) {        print "neg-lookahead($arg) MATCHED: '$1'\n";    } else {        print "Unable to match: neg-lookahead($arg) in '$INSULTSTR'\n";    }}$INSULT = 'Yomama is ugly.  And, she smells like an wet dog.';test_re('Yomama', $INSULT);test_re('ugly', $INSULT);test_re('looks', $INSULT);test_re('And', $INSULT);test_re('And,', $INSULT);test_re('smells', $INSULT);test_re('dog', $INSULT);

Results:

neg-lookahead(Yomama) MATCHED: 'And, she smells like an wet dog.'neg-lookahead(ugly) MATCHED: 'And, she smells like an wet dog.'neg-lookahead(looks) MATCHED: 'Yomama is ugly.'neg-lookahead(And) MATCHED: 'Yomama is ugly.'neg-lookahead(And,) MATCHED: 'Yomama is ugly.'neg-lookahead(smells) MATCHED: 'Yomama is ugly.'neg-lookahead(dog) MATCHED: 'Yomama is ugly.'


If you're curious about what Perl is doing with a regex, you can run with the regex debugger:

perl -Dr -e '"A two. A one." =~ /(?![A-Z][^\.]*(?:two)[^\.]*\.)([A-Z][^\.]+\.)/; print ">$1<\n"'

which will generate much output for you to ponder. You will need a Perl built with -DDEBUGGING.


Your problem is that the regex engine will try as hard as possible to match (?![A-Z].*?$arg.*?\.), so with the "smells" case, it ends up matching the whole string. (The period in the middle is then included in one of the .*? constructs.) You should restrict the negative lookahead case to match only as much as the other case can:

Instead of:

(?:(?![A-Z].*?$arg.*?\.))([A-Z].*?\.)

Use:

(?:(?![A-Z][^.]*$arg[^.]*\.))([A-Z].*?\.)

Now, the negative lookahead cannot match more of the string than the other part can, since it must stop at the first period.