Python: pandas dataframe: Remove "  " BOM character

Question:

I used Scrapy on a Linux machine to crawl some websites and saved in a CSV. When I retrieve the dataset and view on a Windows machine, I saw these characters . Here is what I do to re-encode them to UTF-8-SIG:

import pandas as pd

my_data = pd.read_csv("./dataset/my_data.csv")
output = "./dataset/my_data_converted.csv"
my_data.to_csv(output, encoding='utf-8-sig', index=False)

So now they become ? if viewed on VSCode. But if I view on Notepad++, I don’t see these. How do I actually remove them all?

Asked By: hydradon

||

Answers:

Given your comment, I suppose that you ended up having two BOMs.

Let’s look at a small example.
I’m using built-in open instead of pd.read_csv/pd.to_csv, but the meaning of the encoding parameter is the same.

Let’s create a file saved as UTF-8 with a BOM:

>>> text = 'foo'
>>> with open('/tmp/foo', 'w', encoding='utf-8-sig') as f:
...     f.write(text)

Now let’s read it back in.
But we use a different encoding: “utf-8” instead of “utf-8-sig”.
In your case, you didn’t specify the encoding parameter at all, but the default value is most probably “utf-8” or “cp-1252”, which both keep the BOM.
So the following is more or less equivalent to your code snippet:

>>> with open('/tmp/foo', 'r', encoding='utf8') as f:
...     text = f.read()
... 
>>> text
'ufefffoo'
>>> with open('/tmp/foo_converted', 'w', encoding='utf-8-sig') as f:
...     f.write(text)

The BOM is read as part of the the text; it’s the first character (here represented as "ufeff").

Let’s see what’s actually in the files, using a suitable command-line tool:

$ hexdump -C /tmp/foo
00000000  ef bb bf 66 6f 6f                                 |...foo|
00000006
$ hexdump -C /tmp/foo_converted 
00000000  ef bb bf ef bb bf 66 6f  6f                       |......foo|
00000009

In UTF-8, the BOM is encoded as the three bytes EF BB BF.
Clearly, the second file has two of them.
So even a BOM-aware program will find some non-sense character in the beginning of foo_converted, as the BOM is only stripped once.

Answered By: lenz

For me the BOM was prepended to the first column name. Fortunately Pandas was able to read it into a dataframe, with the BOM still prepended to the first column name. I iterate over ALL columns to remove the BOM from the first column name (since I deal with many different csv files sources, I can’t be sure of the first column name):

     for column in df.columns: #Need to remove Byte Order Marker at beginning of first column name
        new_column_name = re.sub(r"[^0-9a-zA-Z.,-/_ ]", "", column)
        df.rename(columns={column: new_column_name}, inplace=True)

Hope this helps someone..

Answered By: Walter Kelt
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.