Pandas read_csv() conditionally skipping header row
Question:
I’m trying to read a csv
file but my csv files differ. Some have different format and some have other. I’m trying to add controls so that I will not need to edit my code or my input file.
My problem is, some of these csv files have a line of String above the column headers. An example:
Created on 12-11-2018,CryptoDataDownload.com
Date,Symbol,Open,High,Low,Close,Volume From,Volume To
2018-12-11 11-AM,ADABTC,8.6e-06,8.61e-06,8.55e-06,8.57e-06,301141.7,2.59
2018-12-11 10-AM,ADABTC,8.69e-06,8.72e-06,8.6e-06,8.6e-06,236949.63,2.05
If I import this, the delimeter will use the first line and separate the file into two columns as Created on 12-11-2018
and CryptoDataDownload.com
.
This is how df.head()
looks like:
Created on 12-11-2018 CryptoDataDownload.com
Date Symbol Open High Low Close Volume From Volume To
2018-12-11 11-AM ADABTC 8.6e-06 8.61e-06 8.55e-06 8.57e-06 301141.7 2.59
2018-12-11 10-AM ADABTC 8.69e-06 8.72e-06 8.6e-06 8.6e-06 236949.63 2.05
2018-12-11 09-AM ADABTC 8.7e-06 8.7e-06 8.62e-06 8.69e-06 509311.39 4.41
2018-12-11 08-AM ADABTC 8.69e-06 8.7e-06 8.63e-06 8.7e-06 111367.34 0.9656
I want to check if this file has this line and skip it if so.
How can I do this?
Answers:
If the headers in your CSV files follow a similar pattern, you can do something simple like sniffing out the first line before determining whether to skip the first row or not.
filename = '/path/to/file.csv'
skiprows = int('Created in' in next(open(filename)))
df = pd.read_csv(filename, skiprows=skiprows)
Good pratice would be to use a context manager, so you could also do this:
filename = '/path/to/file.csv'
skiprows = 0
with open(filename, 'r+') as f:
for line in f:
if line.startswith('Created '):
skiprows = 1
break
df = pd.read_csv(filename, skiprows=skiprows)
You can skip rows which start with specific character while using ‘comment’ argument in pandas read_csv command. In your case you can skip the lines which starts with “C” using the following code:
filename = '/path/to/file.csv'
pd.read_csv(filename, comment = "C")
It’s work for me:
import os
import requests
CSV_URL = '...'
with open(os.path.split(CSV_URL)[1], 'wb') as f, requests.get(CSV_URL, stream=True) as r:
lines = 0
for line in r.iter_lines():
if lines == 0:
lines += 1
else:
f.write(line+'n'.encode())
For your case:
import os
import requests
CSV_URL = '...'
with open(os.path.split(CSV_URL)[1], 'wb') as f, requests.get(CSV_URL, stream=True) as r:
for line in r.iter_lines():
if line[:11] != 'Created on ':
f.write(line+'n'.encode())
Adapted from: stackoverflow
I’m trying to read a csv
file but my csv files differ. Some have different format and some have other. I’m trying to add controls so that I will not need to edit my code or my input file.
My problem is, some of these csv files have a line of String above the column headers. An example:
Created on 12-11-2018,CryptoDataDownload.com
Date,Symbol,Open,High,Low,Close,Volume From,Volume To
2018-12-11 11-AM,ADABTC,8.6e-06,8.61e-06,8.55e-06,8.57e-06,301141.7,2.59
2018-12-11 10-AM,ADABTC,8.69e-06,8.72e-06,8.6e-06,8.6e-06,236949.63,2.05
If I import this, the delimeter will use the first line and separate the file into two columns as Created on 12-11-2018
and CryptoDataDownload.com
.
This is how df.head()
looks like:
Created on 12-11-2018 CryptoDataDownload.com
Date Symbol Open High Low Close Volume From Volume To
2018-12-11 11-AM ADABTC 8.6e-06 8.61e-06 8.55e-06 8.57e-06 301141.7 2.59
2018-12-11 10-AM ADABTC 8.69e-06 8.72e-06 8.6e-06 8.6e-06 236949.63 2.05
2018-12-11 09-AM ADABTC 8.7e-06 8.7e-06 8.62e-06 8.69e-06 509311.39 4.41
2018-12-11 08-AM ADABTC 8.69e-06 8.7e-06 8.63e-06 8.7e-06 111367.34 0.9656
I want to check if this file has this line and skip it if so.
How can I do this?
If the headers in your CSV files follow a similar pattern, you can do something simple like sniffing out the first line before determining whether to skip the first row or not.
filename = '/path/to/file.csv'
skiprows = int('Created in' in next(open(filename)))
df = pd.read_csv(filename, skiprows=skiprows)
Good pratice would be to use a context manager, so you could also do this:
filename = '/path/to/file.csv'
skiprows = 0
with open(filename, 'r+') as f:
for line in f:
if line.startswith('Created '):
skiprows = 1
break
df = pd.read_csv(filename, skiprows=skiprows)
You can skip rows which start with specific character while using ‘comment’ argument in pandas read_csv command. In your case you can skip the lines which starts with “C” using the following code:
filename = '/path/to/file.csv'
pd.read_csv(filename, comment = "C")
It’s work for me:
import os
import requests
CSV_URL = '...'
with open(os.path.split(CSV_URL)[1], 'wb') as f, requests.get(CSV_URL, stream=True) as r:
lines = 0
for line in r.iter_lines():
if lines == 0:
lines += 1
else:
f.write(line+'n'.encode())
For your case:
import os
import requests
CSV_URL = '...'
with open(os.path.split(CSV_URL)[1], 'wb') as f, requests.get(CSV_URL, stream=True) as r:
for line in r.iter_lines():
if line[:11] != 'Created on ':
f.write(line+'n'.encode())
Adapted from: stackoverflow