Negative Lookahead Regex greed (why is .*? too greedy)
#!/usr/bin/perlsub test_re { $arg = $_[0]; $INSULTSTR = $_[1]; $INSULTSTR =~ /(?:^|\.\s*)(?:(?![^.]*?$arg[^.]*\.))([^.]*\.)/; if ($1) { print "neg-lookahead($arg) MATCHED: '$1'\n"; } else { print "Unable to match: neg-lookahead($arg) in '$INSULTSTR'\n"; }}$INSULT = 'Yomama is ugly. And, she smells like an wet dog.';test_re('Yomama', $INSULT);test_re('ugly', $INSULT);test_re('looks', $INSULT);test_re('And', $INSULT);test_re('And,', $INSULT);test_re('smells', $INSULT);test_re('dog', $INSULT);
Results:
neg-lookahead(Yomama) MATCHED: 'And, she smells like an wet dog.'neg-lookahead(ugly) MATCHED: 'And, she smells like an wet dog.'neg-lookahead(looks) MATCHED: 'Yomama is ugly.'neg-lookahead(And) MATCHED: 'Yomama is ugly.'neg-lookahead(And,) MATCHED: 'Yomama is ugly.'neg-lookahead(smells) MATCHED: 'Yomama is ugly.'neg-lookahead(dog) MATCHED: 'Yomama is ugly.'
If you're curious about what Perl is doing with a regex, you can run with the regex debugger:
perl -Dr -e '"A two. A one." =~ /(?![A-Z][^\.]*(?:two)[^\.]*\.)([A-Z][^\.]+\.)/; print ">$1<\n"'
which will generate much output for you to ponder. You will need a Perl built with -DDEBUGGING.
Your problem is that the regex engine will try as hard as possible to match (?![A-Z].*?$arg.*?\.)
, so with the "smells" case, it ends up matching the whole string. (The period in the middle is then included in one of the .*?
constructs.) You should restrict the negative lookahead case to match only as much as the other case can:
Instead of:
(?:(?![A-Z].*?$arg.*?\.))([A-Z].*?\.)
Use:
(?:(?![A-Z][^.]*$arg[^.]*\.))([A-Z].*?\.)
Now, the negative lookahead cannot match more of the string than the other part can, since it must stop at the first period.