Inconsistency between sed and python regular expressions Inconsistency between sed and python regular expressions python python

Inconsistency between sed and python regular expressions


Both Python and sed are greedy by default but...Python regex tries to evaluate from left to right in all circumstances, despite of it must do eventually a backtrace to the previous state if the branch being tried can not continue by matching.Sed regex on the contrary are optimized before evaluating in order to prevent an unnecessary backtrace, by rewriting the regex to a more deterministic form. Therefore the combined optional pattern "aab" is probably tested before the plain "a" because the most specific possible string is tried first.

Python pattern matches the string "aabb" twice as "aab" + "b" (marked between "<>")

>>> re.sub("a*((ab)*)b", r"<\1>", "aabb")'<><>'

while sed matches the whole "aabb" by one substitution:

$ echo "aabb" | sed "s/a*\(\(ab\)*\)b/<\1>/"<ab>

Python regex backtrace algorithm is explained good in regex howto - Repeating Things in two paragraphs introduced by words "A step-by-step example...". It does IMO exactly what is described regex docs: "As the target string is scanned, REs separated by '|' are tried from left to right."

Demonstration

The order of "(|a|aa)" btw. "(aa|a|)" is respected by Python

>>> re.sub("(?:|a|aa)((ab)*)b", r"<\1>", "aabb")'<ab>'>>> re.sub("(?:aa|a|)((ab)*)b", r"<\1>", "aabb")'<><>'

but this order is ignored by sed because sed optimizes regular expressions. Matching "aab" + "b" can be reproduced removing "a" option from the pattern.

$ echo "aabb" | sed "s/\(\|a\|aa\)\(\(ab\)*\)b/<\2>/g"<ab>$ echo "aabb" | sed "s/\(aa\|a\|\)\(\(ab\)*\)b/<\2>/g"<ab>$ echo "aabb" | sed "s/\(aa\|\)\(\(ab\)*\)b/<\2>/g"<><>

Edit: I removed everything about DFA/NFA because I can not prove it from current texts.


Interesting puzzle you've constructed. From what I've read, the regexp engines of both python and sed are based on Henry Spencer's regex library (as is perl's), which relies on backtracking. (Unfortunately I can't find the article I'm basing this on).

Anyway, this is not something that's supposed to be an implementation detail: Python's behavior goes against the POSIX standard, which requires REs to (a) match at the earliest possible point, and (b) match the longest possible string that starts at that point. (See man 7 regex (on Linux) for this and a whole lot more.)

To find the longest match, a backtracking ("NFA-type") regex engine must continue examining alternatives after it finds one match. So it's not surprising that the implementers cut corners. Obviously, python's behavior is non-conforming since it fails to find the longest match. According to the sed manual page, sed doesn't always conform either, "for performance reasons". But obviously it gets this case right.

Incidentally, your commands are not fully equivalent: re.sub will perform a substitution as many times as possible, while sed's s/a/b/ will only perform it once.The sed version should have been:

echo "aabb" | sed "s/a*\(\(ab\)*\)b/\1/g"

This explains why we get the empty string in python: The RE matches aab the first time and the remaining b the second time, removing each part (since it's all matched by a* and the final b of the regexp). You can see this by the following variant:

>>> re.sub("a*((ab)*)b", r"X\1Y", "aabb")'XYXY'