Python: pandas dataframe: Remove "  " BOM character
Question:
I used Scrapy on a Linux machine to crawl some websites and saved in a CSV. When I retrieve the dataset and view on a Windows machine, I saw these characters 
. Here is what I do to re-encode them to UTF-8-SIG
:
import pandas as pd
my_data = pd.read_csv("./dataset/my_data.csv")
output = "./dataset/my_data_converted.csv"
my_data.to_csv(output, encoding='utf-8-sig', index=False)
So now they become ?
if viewed on VSCode. But if I view on Notepad++, I don’t see these. How do I actually remove them all?
Answers:
Given your comment, I suppose that you ended up having two BOMs.
Let’s look at a small example.
I’m using built-in open
instead of pd.read_csv
/pd.to_csv
, but the meaning of the encoding
parameter is the same.
Let’s create a file saved as UTF-8 with a BOM:
>>> text = 'foo'
>>> with open('/tmp/foo', 'w', encoding='utf-8-sig') as f:
... f.write(text)
Now let’s read it back in.
But we use a different encoding: “utf-8” instead of “utf-8-sig”.
In your case, you didn’t specify the encoding parameter at all, but the default value is most probably “utf-8” or “cp-1252”, which both keep the BOM.
So the following is more or less equivalent to your code snippet:
>>> with open('/tmp/foo', 'r', encoding='utf8') as f:
... text = f.read()
...
>>> text
'ufefffoo'
>>> with open('/tmp/foo_converted', 'w', encoding='utf-8-sig') as f:
... f.write(text)
The BOM is read as part of the the text; it’s the first character (here represented as "ufeff"
).
Let’s see what’s actually in the files, using a suitable command-line tool:
$ hexdump -C /tmp/foo
00000000 ef bb bf 66 6f 6f |...foo|
00000006
$ hexdump -C /tmp/foo_converted
00000000 ef bb bf ef bb bf 66 6f 6f |......foo|
00000009
In UTF-8, the BOM is encoded as the three bytes EF BB BF
.
Clearly, the second file has two of them.
So even a BOM-aware program will find some non-sense character in the beginning of foo_converted, as the BOM is only stripped once.
For me the BOM was prepended to the first column name. Fortunately Pandas was able to read it into a dataframe, with the BOM still prepended to the first column name. I iterate over ALL columns to remove the BOM from the first column name (since I deal with many different csv files sources, I can’t be sure of the first column name):
for column in df.columns: #Need to remove Byte Order Marker at beginning of first column name
new_column_name = re.sub(r"[^0-9a-zA-Z.,-/_ ]", "", column)
df.rename(columns={column: new_column_name}, inplace=True)
Hope this helps someone..
I used Scrapy on a Linux machine to crawl some websites and saved in a CSV. When I retrieve the dataset and view on a Windows machine, I saw these characters 
. Here is what I do to re-encode them to UTF-8-SIG
:
import pandas as pd
my_data = pd.read_csv("./dataset/my_data.csv")
output = "./dataset/my_data_converted.csv"
my_data.to_csv(output, encoding='utf-8-sig', index=False)
So now they become ?
if viewed on VSCode. But if I view on Notepad++, I don’t see these. How do I actually remove them all?
Given your comment, I suppose that you ended up having two BOMs.
Let’s look at a small example.
I’m using built-in open
instead of pd.read_csv
/pd.to_csv
, but the meaning of the encoding
parameter is the same.
Let’s create a file saved as UTF-8 with a BOM:
>>> text = 'foo'
>>> with open('/tmp/foo', 'w', encoding='utf-8-sig') as f:
... f.write(text)
Now let’s read it back in.
But we use a different encoding: “utf-8” instead of “utf-8-sig”.
In your case, you didn’t specify the encoding parameter at all, but the default value is most probably “utf-8” or “cp-1252”, which both keep the BOM.
So the following is more or less equivalent to your code snippet:
>>> with open('/tmp/foo', 'r', encoding='utf8') as f:
... text = f.read()
...
>>> text
'ufefffoo'
>>> with open('/tmp/foo_converted', 'w', encoding='utf-8-sig') as f:
... f.write(text)
The BOM is read as part of the the text; it’s the first character (here represented as "ufeff"
).
Let’s see what’s actually in the files, using a suitable command-line tool:
$ hexdump -C /tmp/foo
00000000 ef bb bf 66 6f 6f |...foo|
00000006
$ hexdump -C /tmp/foo_converted
00000000 ef bb bf ef bb bf 66 6f 6f |......foo|
00000009
In UTF-8, the BOM is encoded as the three bytes EF BB BF
.
Clearly, the second file has two of them.
So even a BOM-aware program will find some non-sense character in the beginning of foo_converted, as the BOM is only stripped once.
For me the BOM was prepended to the first column name. Fortunately Pandas was able to read it into a dataframe, with the BOM still prepended to the first column name. I iterate over ALL columns to remove the BOM from the first column name (since I deal with many different csv files sources, I can’t be sure of the first column name):
for column in df.columns: #Need to remove Byte Order Marker at beginning of first column name
new_column_name = re.sub(r"[^0-9a-zA-Z.,-/_ ]", "", column)
df.rename(columns={column: new_column_name}, inplace=True)
Hope this helps someone..