How to escape the escapechar in pandas to_csv

Question:

I’m trying to write dataframes to CSV. A lot of the incoming data is user-generated and may contain special characters. I can set escapechar='\' (for example), but then if there is a backslash in the data it gets written as "" which gets interpreted as an escaped double-quote as opposed to a string containing a backslash. How can I escape the escapechar (ie, how can I have to_csv write \ by escaping the backslash?)

Example code:

import pandas as pd
import io, csv

data = [[1, "\", "text"]] 
df = pd.DataFrame(data)

sIo = io.StringIO()
df.to_csv(
    sIo,
    index=False,
    sep=',',
    header=False,
    quoting=csv.QUOTE_MINIMAL,
    doublequote=False,
    escapechar='\'
)
sioText = sIo.getvalue()
print(sioText)

Actual output:

1,"",text

What I need:

1,"\",text

The engineering use case that creates the constraints is that this will be some core code for moving data from one system to another. I won’t know the format of the data in advance and won’t have much control over it (any column could contain the escape character), and I can’t control the escape character on the other side so the actual output will be read as an error. Hence the original question of "how do you escape the escape character."

For reference this parameter’s definition in the pandas docs is:

escapecharstr, default None
String of length 1. Character used to escape sep and quotechar when appropriate.
Asked By: dudemonkey

||

Answers:

Huh. This seems like an open issue with round-tripping data from pandas to csv. See this issue: https://github.com/pandas-dev/pandas/issues/14122, and especially pandas creator Wes McKinney’s post:

This behavior is present in the csv module https://gist.github.com/wesm/7763d396ae25c9fd5b27588da27015e4 . From first principles seems like the offending backslash should be escaped. If I manually edit the file to be

"a"
"Hello! Please "help" me. I cannot quote a csv.\"

then read_csv returns the original input

I fiddled with R and it doesn’t seem to do much better

> df <- data.frame(a=c("Hello! Please "help" me. I cannot quote a csv.\"))> write.table(df, sep=',', qmethod='e', row.names=F)
"a"
"Hello! Please "help" me. I cannot quote a csv."

Another example of CSV not being a high fidelity data interchange tool =|

I’m as baffled as you that this doesn’t work, but seems like the official position is… df[col]=df[col].str.replace({"\": "\\"})?

Answered By: Michael Delgado

I solved this by using Pandas’ regex replacer:

df = df.replace('\\', '\\\\', regex=True)

We need four slashes per final slash, because we are doing two layers of escaping. One for literal Python strings, and one to escape them in the regular expression. This will find-replace any s in any column in the data frame, anywhere they appear in the string.

It is mind-boggling to me that this is still the default behavior.

Answered By: garrettmills
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.