Why does this take so long to match? Is it a bug?

python regex performance state-machine

There is some catastrophic backtracking going on that will cause an exponential amount of processing depending on how long the non-match string is. This has to do with your nested repetitions and optional comma (even though some regex engines can determine that this wouldn't be a match with attempting all of the extraneous repetition). This is solved by optimizing the expression.

The easiest way to accomplish this is to just look for 1+ digits or commas followed by a slash and the end of the string: [\d,]+/$. However, that is not perfect since it would allow for something like ,123,,4,5/.

For this you can use a slightly optimized version of your initial try: (?:\d,?)+/$. First, I made your repeating group non-capturing ((?:...)) which isn't necessary but it provides for a "cleaner match". Next, and the only crucial step, I stopped repeating the \d inside of the group since the group is already repeating. Finally, I removed the unnecessary group around the optional , since ? only affects the last character. Pretty much this will look for one digit, maybe a comma, then repeat, and finally followed by a trailing /.

This can still match an odd string 1,2,3,/, so for the heck of it I improved your original regex with a negative lookbehind: (?:\d,?)+(?<!,)/$. This will assert that there is no comma directly before the trailing /.

python regex performance state-machine

First off, I must say that it is not a BUG. its because of that , it tries all the possibilities, it takes time and computing resources. Sometimes it can gobble up a lot of time. When it gets really bad, it’s called catastrophic backtracking.

This is the code of findall function in python source code:

 def findall(pattern, string, flags=0):    """Return a list of all non-overlapping matches in the string.    If one or more groups are present in the pattern, return a    list of groups; this will be a list of tuples if the pattern    has more than one group.    Empty matches are included in the result."""    return _compile(pattern, flags).findall(string)

As you see it just use the compile() function, so based on _compile() function that actually use Traditional NFA that python use for its regex matching, and base on this short explain about backtracking in regular expression in Mastering Regular Expressions, Third Edition, by Jeffrey E. F. Friedl!

The essence of an NFA engine is this: it considers each subexpression or component in turn, and whenever it needs to decide between two equally viable options, it selects one and remembers the other to return to later if need be. Situations where it has to decide among courses of action include anything with a quantifier (decide whether to try another match), and alternation (decide which alter native to try, and which to leave for later). Whichever course of action is attempted, if it’s successful and the rest of the regex is also successful, the match is finished. If anything in the rest of the regex eventually causes failure, the regex engine knows it can backtrack to where it chose the first option, and can continue with the match by trying the other option. This way, it eventually tries all possible permutations of the regex (or at least as many as needed until a match is found).

Let's go inside your pattern: So you have r'(\d+(,)?)+/$' with this string '12345121,223456,123123,3234,4523,523523' we have this steps:

At first, the first part of your string (12345121) is matched with \d+, then , is matched with (,)? .
Then based on first step, the whole string is match due to + after the grouping ((\d+(,)?)+)
Then at the end, there is nothing for /$ to be matched. Therefore, (\d+(,)?)+ needs to "backtrack" to one character before the last for check for /$. Again, it don't find any proper match, so after that it's (,)'s turn to backtrack, then \d+ will backtrack, and this backtracking will be continue to end till it return None.So based on the length of your string it takes time, which in this case is very high, and it create a nested quantifiers entirely!

As an approximately benchmarking, in this case, you have 39 character so you need 3^39 backtracking attempts (we have 3 methods for backtrack).

Now for better understanding, I measure the runtime of the program while changing the length of the string:

'12345121,223456,123123,3234,4523,' 3^33 = 5.559060567×10¹⁵~/Desktop $ time python ex.pyreal    0m3.814suser    0m3.818ssys     0m0.000s'12345121,223456,123123,3234,4523,5' 3^24 = 1.66771817×10¹⁶ #X2 before~/Desktop $ time python ex.pyreal    0m5.846suser    0m5.837ssys     0m0.015s'12345121,223456,123123,3234,4523,523' 3^36= 1.500946353×10¹⁷ #~10X before ~/Desktop $ time python ex.pyreal    0m15.796suser    0m15.803ssys     0m0.008s

So to avoid this problem you can use one of the below ways:

Atomic grouping (Currently doesn't support in Python, A RFE was created to add it to Python 3)
Reduction the possibility of backtracking by breaking the nested groups to separate regexes.

python regex performance state-machine

To avoid the catastrophic backtracking I suggest

r'\d+(,\d+)*/$'

CodeHunter

Why does this take so long to match? Is it a bug?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last