How to translate "bytes" objects into literal strings in pandas Dataframe, Python3.x?

Question:

I have a Python3.x pandas DataFrame whereby certain columns are strings which as expressed as bytes (like in Python2.x)

import pandas as pd
df = pd.DataFrame(...)
df
       COLUMN1         ....
0      b'abcde'        ....
1      b'dog'          ....
2      b'cat1'         ....
3      b'bird1'        ....
4      b'elephant1'    ....

When I access by column with df.COLUMN1, I see Name: COLUMN1, dtype: object

However, if I access by element, it is a “bytes” object

df.COLUMN1.ix[0].dtype
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'dtype'

How do I convert these into “regular” strings? That is, how can I get rid of this b'' prefix?

Asked By: ShanZhengYang

||

Answers:

You can use vectorised str.decode to decode byte strings into ordinary strings:

df['COLUMN1'].str.decode("utf-8")

To do this for multiple columns you can select just the str columns:

str_df = df.select_dtypes([np.object])

convert all of them:

str_df = str_df.stack().str.decode('utf-8').unstack()

You can then swap out converted cols with the original df cols:

for col in str_df:
    df[col] = str_df[col]
Answered By: EdChum
df['COLUMN1'].apply(lambda x: x.decode("utf-8"))
Answered By: Yu Zhou

Combining the answers by @EdChum and @Yu Zhou, a simpler solution would be:

for col, dtype in df.dtypes.items():
    if dtype == np.object:  # Only process byte object columns.
        df[col] = df[col].apply(lambda x: x.decode("utf-8"))

I came across this thread while trying to solve the same problem but more generally for a Series where some values my be of type str, others of type bytes. Drawing from earlier solutions, I achieved this selective decoding as follows, resulting in a Series all of whose values are of type str. (python 3.6.9, pandas 1.0.5)

>>> import pandas as pd
>>> ser = pd.Series(["value_1".encode("utf-8"), "value_2"])
>>> ser.values
array([b'value_1', 'value_2'], dtype=object)
>>> ser2 = ser.str.decode("utf-8")
>>> ser[~ser2.isna()] = ser2
>>> ser.values
array(['value_1', 'value_2'], dtype=object)

Maybe there exists a more convenient/efficient one-liner for this use case? At first I figured there would be some value to pass in the "errors" kwarg to str.decode but I didn’t find one documented.

EDIT: One can definitely achieve the same in one line, but the ways I have thought to so do so take about 25% (tested for Series of length 10^4 and 10^6), but presumably does no copying. E.g.:

ser[ser.apply(type) == bytes] = ser.str.decode("utf-8")
Answered By: Carl Smith

I add issue with some columns being either full of str or mixed of str and bytes in a dataframe. Solved with a minor modification of the solution provided by @Christabella Irwanto: (i’m more of fan of the str.decode('utf-8') as suggested by @Mad Physicist)

for col, dtype in df.dtypes.items():
        if dtype == np.object:  # Only process object columns.
            # decode, or return original value if decode return Nan
            df[col] = df[col].str.decode('utf-8').fillna(df[col]) 


>>> df[col]
0        Element
1     b'Element'
2         b'165'
3            165
4             25
5             25

>>> df[col].str.decode('utf-8').fillna(df[col])
0     Element
1     Element
2         165
3         165
4          25
5          25
6          25
Answered By: GentilsTo