pandas dataframe convert column type to string or categorical pandas dataframe convert column type to string or categorical pandas pandas

pandas dataframe convert column type to string or categorical


You need astype:

df['zipcode'] = df.zipcode.astype(str)#df.zipcode = df.zipcode.astype(str)

For converting to categorical:

df['zipcode'] = df.zipcode.astype('category')#df.zipcode = df.zipcode.astype('category')

Another solution is Categorical:

df['zipcode'] = pd.Categorical(df.zipcode)

Sample with data:

import pandas as pddf = pd.DataFrame({'zipcode': {17384: 98125, 2680: 98107, 722: 98005, 18754: 98109, 14554: 98155}, 'bathrooms': {17384: 1.5, 2680: 0.75, 722: 3.25, 18754: 1.0, 14554: 2.5}, 'sqft_lot': {17384: 1650, 2680: 3700, 722: 51836, 18754: 2640, 14554: 9603}, 'bedrooms': {17384: 2, 2680: 2, 722: 4, 18754: 2, 14554: 4}, 'sqft_living': {17384: 1430, 2680: 1440, 722: 4670, 18754: 1130, 14554: 3180}, 'floors': {17384: 3.0, 2680: 1.0, 722: 2.0, 18754: 1.0, 14554: 2.0}})
print (df)       bathrooms  bedrooms  floors  sqft_living  sqft_lot  zipcode722         3.25         4     2.0         4670     51836    980052680        0.75         2     1.0         1440      3700    9810714554       2.50         4     2.0         3180      9603    9815517384       1.50         2     3.0         1430      1650    9812518754       1.00         2     1.0         1130      2640    98109print (df.dtypes)bathrooms      float64bedrooms         int64floors         float64sqft_living      int64sqft_lot         int64zipcode          int64dtype: objectdf['zipcode'] = df.zipcode.astype('category')print (df)       bathrooms  bedrooms  floors  sqft_living  sqft_lot zipcode722         3.25         4     2.0         4670     51836   980052680        0.75         2     1.0         1440      3700   9810714554       2.50         4     2.0         3180      9603   9815517384       1.50         2     3.0         1430      1650   9812518754       1.00         2     1.0         1130      2640   98109print (df.dtypes)bathrooms       float64bedrooms          int64floors          float64sqft_living       int64sqft_lot          int64zipcode        categorydtype: object


With pandas >= 1.0 there is now a dedicated string datatype:

1) You can convert your column to this pandas string datatype using .astype('string'):

df['zipcode'] = df['zipcode'].astype('string')

2) This is different from using str which sets the pandas object datatype:

df['zipcode'] = df['zipcode'].astype(str)

3) For changing into categorical datatype use:

df['zipcode'] = df['zipcode'].astype('category')

You can see this difference in datatypes when you look at the info of the dataframe:

df = pd.DataFrame({    'zipcode_str': [90210, 90211] ,    'zipcode_string': [90210, 90211],    'zipcode_category': [90210, 90211],})df['zipcode_str'] = df['zipcode_str'].astype(str)df['zipcode_string'] = df['zipcode_str'].astype('string')df['zipcode_category'] = df['zipcode_category'].astype('category')df.info()# you can see that the first column has dtype object# while the second column has the new dtype string# the third column has dtype category #   Column            Non-Null Count  Dtype   ---  ------            --------------  -----    0   zipcode_str       2 non-null      object   1   zipcode_string    2 non-null      string   2   zipcode_category  2 non-null      categorydtypes: category(1), object(1), string(1)

From the docs:

The 'string' extension type solves several issues with object-dtypeNumPy arrays:

  1. You can accidentally store a mixture of strings and non-strings in anobject dtype array. A StringArray can only store strings.

  2. object dtype breaks dtype-specific operations likeDataFrame.select_dtypes(). There isn’t a clear way to select just textwhile excluding non-text, but still object-dtype columns.

  3. When reading code, the contents of an object dtype array is less clearthan string.

More info on working with the new string datatype can be found here:https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html


Prior answers focused on nominal data (e.g. unordered). If there is a reason to impose order for an ordinal variable, then one would use:

# Transform to categorydf['zipcode_category'] = df['zipcode_category'].astype('category')# Add ordered categorydf['zipcode_ordered'] = df['zipcode_category']# Setup the orderingdf.zipcode_ordered.cat.set_categories(    new_categories = [90211, 90210], ordered = True, inplace = True)# Output IDsdf['zipcode_ordered_id'] = df.zipcode_ordered.cat.codesprint(df)#  zipcode_category zipcode_ordered  zipcode_ordered_id#            90210           90210                   1#            90211           90211                   0

More details on setting ordered categories can be found at the pandas website:

https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#sorting-and-order