Replace function in dataframe removes more than expected
Question:
Suppose I have the following csv file:
Column1;Column2
AB; Test
Which has a NUL byte character in it:
I want to read it in with pandas, manipulate it and export it again. I want to remove NUL byte characters, my code is as follows:
import pandas as pd
df=pd.read_csv(r'C:foldertest.csv', sep=";", encoding="utf-8")
df=df.replace('x00','', regex=True)
df.to_csv(r'C:foldertestoutput.csv', sep=";", index=False, encoding="utf-8")
I expect that only the NUL byte character is removed, however, more is removed:
So it removes the whole "cell", also the "Test". But I don’t want this.
Why, where is my mistake? How can I only remove the NUL byte character?
Answers:
The issue is not coming from the replace
, but rather from the fact that the NULL character is handled as a line terminator by read_csv
.
What you can do it manually remove the NULL before passing it to read_csv
:
import io
f = io.StringIO(open(r'C:foldertest.csv').read().replace('x00', ''))
df = pd.read_csv(f, sep=";", encoding="utf-8")
Output:
Column1 Column2
0 AB Test
Alternatively, if you just want to remove the NULL without pandas:
with open(r'C:foldertest.csv', 'r') as f1, open(r'C:foldertest_clean.csv', 'w') as f2:
for line in f1:
f2.write(line.replace('x00', ''))
Suppose I have the following csv file:
Column1;Column2
AB; Test
Which has a NUL byte character in it:
I want to read it in with pandas, manipulate it and export it again. I want to remove NUL byte characters, my code is as follows:
import pandas as pd
df=pd.read_csv(r'C:foldertest.csv', sep=";", encoding="utf-8")
df=df.replace('x00','', regex=True)
df.to_csv(r'C:foldertestoutput.csv', sep=";", index=False, encoding="utf-8")
I expect that only the NUL byte character is removed, however, more is removed:
So it removes the whole "cell", also the "Test". But I don’t want this.
Why, where is my mistake? How can I only remove the NUL byte character?
The issue is not coming from the replace
, but rather from the fact that the NULL character is handled as a line terminator by read_csv
.
What you can do it manually remove the NULL before passing it to read_csv
:
import io
f = io.StringIO(open(r'C:foldertest.csv').read().replace('x00', ''))
df = pd.read_csv(f, sep=";", encoding="utf-8")
Output:
Column1 Column2
0 AB Test
Alternatively, if you just want to remove the NULL without pandas:
with open(r'C:foldertest.csv', 'r') as f1, open(r'C:foldertest_clean.csv', 'w') as f2:
for line in f1:
f2.write(line.replace('x00', ''))