What is dtype('O'), in pandas?
It means:
'O' (Python) objects
The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised. The supported kinds are to an existing type, or an error will be raised. The supported kinds are:
'b' boolean'i' (signed) integer'u' unsigned integer'f' floating-point'c' complex-floating point'O' (Python) objects'S', 'a' (byte-)string'U' Unicode'V' raw data (void)
Another answer helps if need check type
s.
When you see dtype('O')
inside dataframe this means Pandas string.
What is dtype
?
Something that belongs to pandas
or numpy
, or both, or something else? If we examine pandas code:
df = pd.DataFrame({'float': [1.0], 'int': [1], 'datetime': [pd.Timestamp('20180310')], 'string': ['foo']})print(df)print(df['float'].dtype,df['int'].dtype,df['datetime'].dtype,df['string'].dtype)df['string'].dtype
It will output like this:
float int datetime string 0 1.0 1 2018-03-10 foo---float64 int64 datetime64[ns] object---dtype('O')
You can interpret the last as Pandas dtype('O')
or Pandas object which is Python type string, and this corresponds to Numpy string_
, or unicode_
types.
Pandas dtype Python type NumPy type Usageobject str string_, unicode_ Text
Like Don Quixote is on ass, Pandas is on Numpy and Numpy understand the underlying architecture of your system and uses the class numpy.dtype
for that.
Data type object is an instance of numpy.dtype
class that understand the data type more precise including:
- Type of the data (integer, float, Python object, etc.)
- Size of the data (how many bytes is in e.g. the integer)
- Byte order of the data (little-endian or big-endian)
- If the data type is structured, an aggregate of other data types, (e.g., describing an array item consisting of an integer and a float)
- What are the names of the "fields" of the structure
- What is the data-type of each field
- Which part of the memory block each field takes
- If the data type is a sub-array, what is its shape and data type
In the context of this question dtype
belongs to both pands and numpy and in particular dtype('O')
means we expect the string.
Here is some code for testing with explanation:If we have the dataset as dictionary
import pandas as pdimport numpy as npfrom pandas import Timestampdata={'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'date': {0: Timestamp('2018-12-12 00:00:00'), 1: Timestamp('2018-12-12 00:00:00'), 2: Timestamp('2018-12-12 00:00:00'), 3: Timestamp('2018-12-12 00:00:00'), 4: Timestamp('2018-12-12 00:00:00')}, 'role': {0: 'Support', 1: 'Marketing', 2: 'Business Development', 3: 'Sales', 4: 'Engineering'}, 'num': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567}, 'fnum': {0: 3.14, 1: 2.14, 2: -0.14, 3: 41.3, 4: 3.14}}df = pd.DataFrame.from_dict(data) #now we have a dataframeprint(df)print(df.dtypes)
The last lines will examine the dataframe and note the output:
id date role num fnum0 1 2018-12-12 Support 123 3.141 2 2018-12-12 Marketing 234 2.142 3 2018-12-12 Business Development 345 -0.143 4 2018-12-12 Sales 456 41.304 5 2018-12-12 Engineering 567 3.14id int64date datetime64[ns]role objectnum int64fnum float64dtype: object
All kind of different dtypes
df.iloc[1,:] = np.nandf.iloc[2,:] = None
But if we try to set np.nan
or None
this will not affect the original column dtype. The output will be like this:
print(df)print(df.dtypes) id date role num fnum0 1.0 2018-12-12 Support 123.0 3.141 NaN NaT NaN NaN NaN2 NaN NaT None NaN NaN3 4.0 2018-12-12 Sales 456.0 41.304 5.0 2018-12-12 Engineering 567.0 3.14id float64date datetime64[ns]role objectnum float64fnum float64dtype: object
So np.nan
or None
will not change the columns dtype
, unless we set the all column rows to np.nan
or None
. In that case column will become float64
or object
respectively.
You may try also setting single rows:
df.iloc[3,:] = 0 # will convert datetime to object onlydf.iloc[4,:] = '' # will convert all columns to object
And to note here, if we set string inside a non string column it will become string or object dtype
.