Ignore case in Python strings [duplicate] Ignore case in Python strings [duplicate] python python

Ignore case in Python strings [duplicate]


Here is a benchmark showing that using str.lower is faster than the accepted answer's proposed method (libc.strcasecmp):

#!/usr/bin/env python2.7import randomimport timeitfrom ctypes import *libc = CDLL('libc.dylib') # change to 'libc.so.6' on linuxwith open('/usr/share/dict/words', 'r') as wordlist:    words = wordlist.read().splitlines()random.shuffle(words)print '%i words in list' % len(words)setup = 'from __main__ import words, libc; gc.enable()'stmts = [    ('simple sort', 'sorted(words)'),    ('sort with key=str.lower', 'sorted(words, key=str.lower)'),    ('sort with cmp=libc.strcasecmp', 'sorted(words, cmp=libc.strcasecmp)'),]for (comment, stmt) in stmts:    t = timeit.Timer(stmt=stmt, setup=setup)    print '%s: %.2f msec/pass' % (comment, (1000*t.timeit(10)/10))

typical times on my machine:

235886 words in listsimple sort: 483.59 msec/passsort with key=str.lower: 1064.70 msec/passsort with cmp=libc.strcasecmp: 5487.86 msec/pass

So, the version with str.lower is not only the fastest by far, but also the most portable and pythonic of all the proposed solutions here.I have not profiled memory usage, but the original poster has still not given a compelling reason to worry about it. Also, who says that a call into the libc module doesn't duplicate any strings?

NB: The lower() string method also has the advantage of being locale-dependent. Something you will probably not be getting right when writing your own "optimised" solution. Even so, due to bugs and missing features in Python, this kind of comparison may give you wrong results in a unicode context.


Your question implies that you don't need Unicode. Try the following code snippet; if it works for you, you're done:

Python 2.5.2 (r252:60911, Aug 22 2008, 02:34:17)[GCC 4.3.1] on linux2Type "help", "copyright", "credits" or "license" for more information.>>> import locale>>> locale.setlocale(locale.LC_COLLATE, "en_US")'en_US'>>> sorted("ABCabc", key=locale.strxfrm)['a', 'A', 'b', 'B', 'c', 'C']>>> sorted("ABCabc", cmp=locale.strcoll)['a', 'A', 'b', 'B', 'c', 'C']

Clarification: in case it is not obvious at first sight, locale.strcoll seems to be the function you need, avoiding the str.lower or locale.strxfrm "duplicate" strings.


Are you using this compare in a very-frequently-executed path of a highly-performance-sensitive application? Alternatively, are you running this on strings which are megabytes in size? If not, then you shouldn't worry about the performance and just use the .lower() method.

The following code demonstrates that doing a case-insensitive compare by calling .lower() on two strings which are each almost a megabyte in size takes about 0.009 seconds on my 1.8GHz desktop computer:

from timeit import Timers1 = "1234567890" * 100000 + "a"s2 = "1234567890" * 100000 + "B"code = "s1.lower() < s2.lower()"time = Timer(code, "from __main__ import s1, s2").timeit(1000)print time / 1000   # 0.00920499992371 on my machine

If indeed this is an extremely significant, performance-critical section of code, then I recommend writing a function in C and calling it from your Python code, since that will allow you to do a truly efficient case-insensitive search. Details on writing C extension modules can be found here: https://docs.python.org/extending/extending.html