Why does the size of this Python String change on a failed int conversion

python string python-3.x unicode python-internals

The code that converts strings to ints in CPython 3.6 requests a UTF-8 form of the string to work with:

buffer = PyUnicode_AsUTF8AndSize(asciidig, &buflen);

and the string creates the UTF-8 representation the first time it's requested and caches it on the string object:

if (PyUnicode_UTF8(unicode) == NULL) {    assert(!PyUnicode_IS_COMPACT_ASCII(unicode));    bytes = _PyUnicode_AsUTF8String(unicode, NULL);    if (bytes == NULL)        return NULL;    _PyUnicode_UTF8(unicode) = PyObject_MALLOC(PyBytes_GET_SIZE(bytes) + 1);    if (_PyUnicode_UTF8(unicode) == NULL) {        PyErr_NoMemory();        Py_DECREF(bytes);        return NULL;    }    _PyUnicode_UTF8_LENGTH(unicode) = PyBytes_GET_SIZE(bytes);    memcpy(_PyUnicode_UTF8(unicode),              PyBytes_AS_STRING(bytes),              _PyUnicode_UTF8_LENGTH(unicode) + 1);    Py_DECREF(bytes);}

The extra 3 bytes are for the UTF-8 representation.

You might be wondering why the size doesn't change when the string is something like '40' or 'plain ascii text'. That's because if the string is in "compact ascii" representation, Python doesn't create a separate UTF-8 representation. It returns the ASCII representation directly, which is already valid UTF-8:

#define PyUnicode_UTF8(op)                              \    (assert(_PyUnicode_CHECK(op)),                      \     assert(PyUnicode_IS_READY(op)),                    \     PyUnicode_IS_COMPACT_ASCII(op) ?                   \         ((char*)((PyASCIIObject*)(op) + 1)) :          \         _PyUnicode_UTF8(op))

You also might wonder why the size doesn't change for something like '１'. That's U+FF11 FULLWIDTH DIGIT ONE, which int treats as equivalent to '1'. That's because one of the earlier steps in the string-to-int process is

asciidig = _PyUnicode_TransformDecimalAndSpaceToASCII(u);

which converts all whitespace characters to ' ' and converts all Unicode decimal digits to the corresponding ASCII digits. This conversion returns the original string if it doesn't end up changing anything, but when it does make changes, it creates a new string, and the new string is the one that gets a UTF-8 representation created.

As for the cases where calling int on one string looks like it affects another, those are actually the same string object. There are many conditions under which Python will reuse strings, all just as firmly in Weird Implementation Detail Land as everything we've discussed so far. For 'ñ', the reuse happens because this is a single-character string in the Latin-1 range ('\x00'-'\xff'), and the implementation stores and reuses those.

CodeHunter

Why does the size of this Python String change on a failed int conversion

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last