Is it worth using Python's re.compile? Is it worth using Python's re.compile? python python

Is it worth using Python's re.compile?


I've had a lot of experience running a compiled regex 1000s of times versus compiling on-the-fly, and have not noticed any perceivable difference. Obviously, this is anecdotal, and certainly not a great argument against compiling, but I've found the difference to be negligible.

EDIT:After a quick glance at the actual Python 2.5 library code, I see that Python internally compiles AND CACHES regexes whenever you use them anyway (including calls to re.match()), so you're really only changing WHEN the regex gets compiled, and shouldn't be saving much time at all - only the time it takes to check the cache (a key lookup on an internal dict type).

From module re.py (comments are mine):

def match(pattern, string, flags=0):    return _compile(pattern, flags).match(string)def _compile(*key):    # Does cache check at top of function    cachekey = (type(key[0]),) + key    p = _cache.get(cachekey)    if p is not None: return p    # ...    # Does actual compilation on cache miss    # ...    # Caches compiled regex    if len(_cache) >= _MAXCACHE:        _cache.clear()    _cache[cachekey] = p    return p

I still often pre-compile regular expressions, but only to bind them to a nice, reusable name, not for any expected performance gain.


For me, the biggest benefit to re.compile is being able to separate definition of the regex from its use.

Even a simple expression such as 0|[1-9][0-9]* (integer in base 10 without leading zeros) can be complex enough that you'd rather not have to retype it, check if you made any typos, and later have to recheck if there are typos when you start debugging. Plus, it's nicer to use a variable name such as num or num_b10 than 0|[1-9][0-9]*.

It's certainly possible to store strings and pass them to re.match; however, that's less readable:

num = "..."# then, much later:m = re.match(num, input)

Versus compiling:

num = re.compile("...")# then, much later:m = num.match(input)

Though it is fairly close, the last line of the second feels more natural and simpler when used repeatedly.


FWIW:

$ python -m timeit -s "import re" "re.match('hello', 'hello world')"100000 loops, best of 3: 3.82 usec per loop$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"1000000 loops, best of 3: 1.26 usec per loop

so, if you're going to be using the same regex a lot, it may be worth it to do re.compile (especially for more complex regexes).

The standard arguments against premature optimization apply, but I don't think you really lose much clarity/straightforwardness by using re.compile if you suspect that your regexps may become a performance bottleneck.

Update:

Under Python 3.6 (I suspect the above timings were done using Python 2.x) and 2018 hardware (MacBook Pro), I now get the following timings:

% python -m timeit -s "import re" "re.match('hello', 'hello world')"1000000 loops, best of 3: 0.661 usec per loop% python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"1000000 loops, best of 3: 0.285 usec per loop% python -m timeit -s "import re" "h=re.compile('hello'); h.match('hello world')"1000000 loops, best of 3: 0.65 usec per loop% python --versionPython 3.6.5 :: Anaconda, Inc.

I also added a case (notice the quotation mark differences between the last two runs) that shows that re.match(x, ...) is literally [roughly] equivalent to re.compile(x).match(...), i.e. no behind-the-scenes caching of the compiled representation seems to happen.