How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?
Question:
I need to create a data frame by reading in data from a file, using read_csv
method. However, the separators are not very regular: some columns are separated by tabs (t
), other are separated by spaces. Moreover, some columns can be separated by 2 or 3 or more spaces or even by a combination of spaces and tabs (for example 3 spaces, two tabs and then 1 space).
Is there a way to tell pandas to treat these files properly?
By the way, I do not have this problem if I use Python. I use:
for line in file(file_name):
fld = line.split()
And it works perfect. It does not care if there are 2 or 3 spaces between the fields. Even combinations of spaces and tabs do not cause any problem. Can pandas do the same?
Answers:
From the documentation, you can use either a regex or delim_whitespace
:
>>> import pandas as pd
>>> for line in open("whitespace.csv"):
... print repr(line)
...
'at btc 1 2n'
'dt etf 3 4n'
>>> pd.read_csv("whitespace.csv", header=None, delimiter=r"s+")
0 1 2 3 4
0 a b c 1 2
1 d e f 3 4
>>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)
0 1 2 3 4
0 a b c 1 2
1 d e f 3 4
>>> pd.read_csv("whitespace.csv", header = None, sep = "s+|t+|s+t+|t+s+")
would use any combination of any number of spaces and tabs as the separator.
We may consider this to take care of all the combination and zero or more occurrences.
pd.read_csv("whitespace.csv", header = None, sep = "[ t]*,[ t]*")
Pandas has two csv readers, only is flexible regarding redundant leading white space:
pd.read_csv("whitespace.csv", skipinitialspace=True)
while one is not
pd.DataFrame.from_csv("whitespace.csv")
Neither is out-of-the-box flexible regarding trailing white space, see the answers with regular expressions. Avoid delim_whitespace, as it also allows just spaces (without , or t) as separators.
I need to create a data frame by reading in data from a file, using read_csv
method. However, the separators are not very regular: some columns are separated by tabs (t
), other are separated by spaces. Moreover, some columns can be separated by 2 or 3 or more spaces or even by a combination of spaces and tabs (for example 3 spaces, two tabs and then 1 space).
Is there a way to tell pandas to treat these files properly?
By the way, I do not have this problem if I use Python. I use:
for line in file(file_name):
fld = line.split()
And it works perfect. It does not care if there are 2 or 3 spaces between the fields. Even combinations of spaces and tabs do not cause any problem. Can pandas do the same?
From the documentation, you can use either a regex or delim_whitespace
:
>>> import pandas as pd
>>> for line in open("whitespace.csv"):
... print repr(line)
...
'at btc 1 2n'
'dt etf 3 4n'
>>> pd.read_csv("whitespace.csv", header=None, delimiter=r"s+")
0 1 2 3 4
0 a b c 1 2
1 d e f 3 4
>>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)
0 1 2 3 4
0 a b c 1 2
1 d e f 3 4
>>> pd.read_csv("whitespace.csv", header = None, sep = "s+|t+|s+t+|t+s+")
would use any combination of any number of spaces and tabs as the separator.
We may consider this to take care of all the combination and zero or more occurrences.
pd.read_csv("whitespace.csv", header = None, sep = "[ t]*,[ t]*")
Pandas has two csv readers, only is flexible regarding redundant leading white space:
pd.read_csv("whitespace.csv", skipinitialspace=True)
while one is not
pd.DataFrame.from_csv("whitespace.csv")
Neither is out-of-the-box flexible regarding trailing white space, see the answers with regular expressions. Avoid delim_whitespace, as it also allows just spaces (without , or t) as separators.