Pandas escape carriage return in to_csv
Question:
I have a string column that sometimes has carriage returns in the string:
import pandas as pd
from io import StringIO
datastring = StringIO("""
country metric 2011 2012
USA GDP 7 4
USA Pop. 2 3
GB GDP 8 7
""")
df = pd.read_table(datastring, sep='ss+')
df.metric = df.metric + 'r' # append carriage return
print(df)
country metric 2011 2012
0 USA GDPr 7 4
1 USA Pop.r 2 3
2 GB GDPr 8 7
When writing to and reading from csv, the dataframe gets corrupted:
df.to_csv('data.csv', index=None)
print(pd.read_csv('data.csv'))
country metric 2011 2012
0 USA GDP NaN NaN
1 NaN 7 4 NaN
2 USA Pop. NaN NaN
3 NaN 2 3 NaN
4 GB GDP NaN NaN
5 NaN 8 7 NaN
Question
What’s the best way to fix this? The one obvious method is to just clean the data first:
df.metric = df.metric.str.replace('r', '')
Answers:
Specify the line_terminator
:
print(pd.read_csv('data.csv', line_terminator='n'))
country metric 2011 2012
0 USA GDPr 7 4
1 USA Pop.r 2 3
2 GB GDPr 8 7
UPDATE:
In more recent versions of pandas (the original answer is from 2015) the name of the argument changed to lineterminator
.
To anyone else who is dealing with such an issue:
@mike-müller’s answer doesn’t actually fix the issue, and the file is still corrupted when it is read by other CSV readers (e.g. Excel). You need to fix this once you write the file rather than while reading it.
The problem lies in not quoting strings having newline characters (r
, n
, or rn
depending on the OS style). This will not keep the CSV reader (e.g. pandas, Excel, etc.) from parsing the newline characters and then it messes up the loaded CSV file into having multiple lines per unquoted records.
The generalized newline char in Python is rn
as you strip string by these chars e.g. str.strip('rn')
. This will make Python identify and cover all OS newline styles.
In pandas, reading CSV file by line_terminator='rn'
wraps all strings having either n
or r
into double quotes to preserve quoting and keep readers from parsing newline chars later.
Just to provide the code:
pd.to_csv('data.csv', line_terminator='rn'))
In my case, applying quoting=csv.QUOTE_ALL
solved the issue.
import csv
pd.to_csv('some_data.csv', quoting=csv.QUOTE_ALL)
I have a string column that sometimes has carriage returns in the string:
import pandas as pd
from io import StringIO
datastring = StringIO("""
country metric 2011 2012
USA GDP 7 4
USA Pop. 2 3
GB GDP 8 7
""")
df = pd.read_table(datastring, sep='ss+')
df.metric = df.metric + 'r' # append carriage return
print(df)
country metric 2011 2012
0 USA GDPr 7 4
1 USA Pop.r 2 3
2 GB GDPr 8 7
When writing to and reading from csv, the dataframe gets corrupted:
df.to_csv('data.csv', index=None)
print(pd.read_csv('data.csv'))
country metric 2011 2012
0 USA GDP NaN NaN
1 NaN 7 4 NaN
2 USA Pop. NaN NaN
3 NaN 2 3 NaN
4 GB GDP NaN NaN
5 NaN 8 7 NaN
Question
What’s the best way to fix this? The one obvious method is to just clean the data first:
df.metric = df.metric.str.replace('r', '')
Specify the line_terminator
:
print(pd.read_csv('data.csv', line_terminator='n'))
country metric 2011 2012
0 USA GDPr 7 4
1 USA Pop.r 2 3
2 GB GDPr 8 7
UPDATE:
In more recent versions of pandas (the original answer is from 2015) the name of the argument changed to lineterminator
.
To anyone else who is dealing with such an issue:
@mike-müller’s answer doesn’t actually fix the issue, and the file is still corrupted when it is read by other CSV readers (e.g. Excel). You need to fix this once you write the file rather than while reading it.
The problem lies in not quoting strings having newline characters (r
, n
, or rn
depending on the OS style). This will not keep the CSV reader (e.g. pandas, Excel, etc.) from parsing the newline characters and then it messes up the loaded CSV file into having multiple lines per unquoted records.
The generalized newline char in Python is rn
as you strip string by these chars e.g. str.strip('rn')
. This will make Python identify and cover all OS newline styles.
In pandas, reading CSV file by line_terminator='rn'
wraps all strings having either n
or r
into double quotes to preserve quoting and keep readers from parsing newline chars later.
Just to provide the code:
pd.to_csv('data.csv', line_terminator='rn'))
In my case, applying quoting=csv.QUOTE_ALL
solved the issue.
import csv
pd.to_csv('some_data.csv', quoting=csv.QUOTE_ALL)