Auto-detect the delimiter in a CSV file using pd.read_csv
Question:
Is there a way for read_csv
to auto-detect the delimiter? numpy’s genfromtxt
does this.
My files have data with single space, double space and a tab as delimiters. genfromtxt()
solves it, but is slower than pandas’ read_csv
.
Any ideas?
Answers:
Option 1
Using delim_whitespace=True
df = pd.read_csv('file.csv', delim_whitespace=True)
Option 2
Pass a regular expression to the sep
parameter:
df = pd.read_csv('file.csv', sep='s+')
This is equivalent to the first option
For better control, I use a python module called detect_delimiter from python projects. See https://pypi.org/project/detect-delimiter/ . It has been around for some time. As with all code, you should test with your interpreter prior to deployment. I have tested up to python version 3.8.5.
See code example below where the delimiter is automatically detected, and the var
delimiter is defined from the method’s output. The code then reads the CSV file
with sep = delimiter. I have tested with the following delimiters, although others should work: ; , |
It does not work with multi-char delimiters, such as ","
CAUTION! This method will do nothing to detect a malformed CSV file. In the case
where the input file contains both ; and , the method returns , as the detected delimiter.
from detect_delimiter import detect
import pandas as pd
delimiter = ''
with open(security_rule_file.csv) as myfile:
firstline = myfile.readline()
delimiter = detect(firstline)
myfile.close()
records = pd.read_csv(security_rule_file.csv, sep = delimiter)
Another option is to use the built in CSV Sniffer. I mix it up with only reading a certain number of bytes in case the CSV file is large.
import csv
def get_delimiter(file_path, bytes = 4096):
sniffer = csv.Sniffer()
data = open(file_path, "r").read(bytes)
delimiter = sniffer.sniff(data).delimiter
return delimiter
Is there a way for read_csv
to auto-detect the delimiter? numpy’s genfromtxt
does this.
My files have data with single space, double space and a tab as delimiters. genfromtxt()
solves it, but is slower than pandas’ read_csv
.
Any ideas?
Option 1
Using delim_whitespace=True
df = pd.read_csv('file.csv', delim_whitespace=True)
Option 2
Pass a regular expression to the sep
parameter:
df = pd.read_csv('file.csv', sep='s+')
This is equivalent to the first option
For better control, I use a python module called detect_delimiter from python projects. See https://pypi.org/project/detect-delimiter/ . It has been around for some time. As with all code, you should test with your interpreter prior to deployment. I have tested up to python version 3.8.5.
See code example below where the delimiter is automatically detected, and the var
delimiter is defined from the method’s output. The code then reads the CSV file
with sep = delimiter. I have tested with the following delimiters, although others should work: ; , |
It does not work with multi-char delimiters, such as ","
CAUTION! This method will do nothing to detect a malformed CSV file. In the case
where the input file contains both ; and , the method returns , as the detected delimiter.
from detect_delimiter import detect
import pandas as pd
delimiter = ''
with open(security_rule_file.csv) as myfile:
firstline = myfile.readline()
delimiter = detect(firstline)
myfile.close()
records = pd.read_csv(security_rule_file.csv, sep = delimiter)
Another option is to use the built in CSV Sniffer. I mix it up with only reading a certain number of bytes in case the CSV file is large.
import csv
def get_delimiter(file_path, bytes = 4096):
sniffer = csv.Sniffer()
data = open(file_path, "r").read(bytes)
delimiter = sniffer.sniff(data).delimiter
return delimiter