How to translate "bytes" objects into literal strings in pandas Dataframe, Python3.x? How to translate "bytes" objects into literal strings in pandas Dataframe, Python3.x? arrays arrays

How to translate "bytes" objects into literal strings in pandas Dataframe, Python3.x?


You can use vectorised str.decode to decode byte strings into ordinary strings:

df['COLUMN1'].str.decode("utf-8")

To do this for multiple columns you can select just the str columns:

str_df = df.select_dtypes([np.object])

convert all of them:

str_df = str_df.stack().str.decode('utf-8').unstack()

You can then swap out converted cols with the original df cols:

for col in str_df:    df[col] = str_df[col]


Combining the answers by @EdChum and @Yu Zhou, a simpler solution would be:

for col, dtype in df.dtypes.items():    if dtype == np.object:  # Only process byte object columns.        df[col] = df[col].apply(lambda x: x.decode("utf-8"))


I came across this thread while trying to solve the same problem but more generally for a Series where some values my be of type str, others of type bytes. Drawing from earlier solutions, I achieved this selective decoding as follows, resulting in a Series all of whose values are of type str. (python 3.6.9, pandas 1.0.5)

>>> import pandas as pd>>> ser = pd.Series(["value_1".encode("utf-8"), "value_2"])>>> ser.valuesarray([b'value_1', 'value_2'], dtype=object)>>> ser2 = ser.str.decode("utf-8")>>> ser[~ser2.isna()] = ser2>>> ser.valuesarray(['value_1', 'value_2'], dtype=object)

Maybe there exists a more convenient/efficient one-liner for this use case? At first I figured there would be some value to pass in the "errors" kwarg to str.decode but I didn't find one documented.

EDIT: One can definitely achieve the same in one line, but the ways I have thought to so do so take about 25% (tested for Series of length 10^4 and 10^6), but presumably does no copying. E.g.:

ser[ser.apply(type) == bytes] = ser.str.decode("utf-8")