What could affect Python string comparison performance for strings over 64 characters? What could affect Python string comparison performance for strings over 64 characters? python python

What could affect Python string comparison performance for strings over 64 characters?


Python can 'intern' short strings; stores them in a special cache, and re-uses string objects from that cache.

When then comparing strings, it'll first test if it is the same pointer (e.g. an interned string):

if (a == b) {    switch (op) {    case Py_EQ:case Py_LE:case Py_GE:        result = Py_True;        goto out;// ...

Only if that pointer comparison fails does it use a size check and memcmp to compare the strings.

Interning normally only takes place for identifiers (function names, arguments, attributes, etc.) however, not for string values created at runtime.

Another possible culprit is string constants; string literals used in code are stored as constants at compile time and reused throughout; again only one object is created and identity tests are faster on those.

For string objects that are not the same, Python tests for equal length, equal first characters then uses the memcmp() function on the internal C strings. If your strings are not interned or otherwise are reusing the same objects, all other speed characteristics come down to the memcmp() function.


I am just making wild guesses but you asked "what might" rather than what does so here are some possibilities:

  • The CPU cache line size is 64 bytes and longer strings cause a cache miss.
  • Python might store strings of 64 bytes in one kind of structure and longer strings in a more complicated structure.
  • Related to the last one: it might zero-pad strings into a 64-byte array and is able to use very fast SSE2 vector instructions to match two strings.