pandas distinction between str and object types pandas distinction between str and object types numpy numpy

pandas distinction between str and object types


Numpy's string dtypes aren't python strings.

Therefore, pandas deliberately uses native python strings, which require an object dtype.

First off, let me demonstrate a bit of what I mean by numpy's strings being different:

In [1]: import numpy as npIn [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7')In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)

Now, 'x' is a numpy string dtype (fixed-width, c-like string) and y is an array of native python strings.

If we try to go beyond 7 characters, we'll see an immediate difference. The string dtype versions will be truncated:

In [4]: x[1] = 'a really really really long'In [5]: xOut[5]:array(['Testing', 'a reall', 'string'],      dtype='|S7')

While the object dtype versions can be arbitrary length:

In [6]: y[1] = 'a really really really long'In [7]: yOut[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)

Next, the |S dtype strings can't hold unicode properly, though there is a unicode fixed-length string dtype, as well. I'll skip an example, for the moment.

Finally, numpy's strings are actually mutable, while Python strings are not. For example:

In [8]: z = x.view(np.uint8)In [9]: z += 1In [10]: xOut[10]:array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'],      dtype='|S7')

For all of these reasons, pandas chose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won't work in pandas. Instead, it always uses native python strings, which behave in a more intuitive way for most users.