How to translate "bytes" objects into literal strings in pandas Dataframe, Python3.x? How to translate "bytes" objects into literal strings in pandas Dataframe, Python3.x? python-3.x python-3.x

How to translate "bytes" objects into literal strings in pandas Dataframe, Python3.x?


You can use vectorised str.decode to decode byte strings into ordinary strings:

df['COLUMN1'].str.decode("utf-8")

To do this for multiple columns you can select just the str columns:

str_df = df.select_dtypes([np.object])

convert all of them:

str_df = str_df.stack().str.decode('utf-8').unstack()

You can then swap out converted cols with the original df cols:

for col in str_df:    df[col] = str_df[col]


Combining the answers by @EdChum and @Yu Zhou, a simpler solution would be:

for col, dtype in df.dtypes.items():    if dtype == np.object:  # Only process byte object columns.        df[col] = df[col].apply(lambda x: x.decode("utf-8"))


I came across this thread while trying to solve the same problem but more generally for a Series where some values my be of type str, others of type bytes. Drawing from earlier solutions, I achieved this selective decoding as follows, resulting in a Series all of whose values are of type str. (python 3.6.9, pandas 1.0.5)

>>> import pandas as pd>>> ser = pd.Series(["value_1".encode("utf-8"), "value_2"])>>> ser.valuesarray([b'value_1', 'value_2'], dtype=object)>>> ser2 = ser.str.decode("utf-8")>>> ser[~ser2.isna()] = ser2>>> ser.valuesarray(['value_1', 'value_2'], dtype=object)

Maybe there exists a more convenient/efficient one-liner for this use case? At first I figured there would be some value to pass in the "errors" kwarg to str.decode but I didn't find one documented.

EDIT: One can definitely achieve the same in one line, but the ways I have thought to so do so take about 25% (tested for Series of length 10^4 and 10^6), but presumably does no copying. E.g.:

ser[ser.apply(type) == bytes] = ser.str.decode("utf-8")