How to read a CSV file from a URL with Python?
Question:
when I do curl to a API call link http://example.com/passkey=wedsmdjsjmdd
curl 'http://example.com/passkey=wedsmdjsjmdd'
I get the employee output data on a csv file format, like:
"Steve","421","0","421","2","","","","","","","","","421","0","421","2"
how can parse through this using python.
I tried:
import csv
cr = csv.reader(open('http://example.com/passkey=wedsmdjsjmdd',"rb"))
for row in cr:
print row
but it didn’t work and I got an error
http://example.com/passkey=wedsmdjsjmdd No such file or directory:
Thanks!
Answers:
You need to replace open
with urllib.urlopen or urllib2.urlopen.
e.g.
import csv
import urllib2
url = 'http://winterolympicsmedals.com/medals.csv'
response = urllib2.urlopen(url)
cr = csv.reader(response)
for row in cr:
print row
This would output the following
Year,City,Sport,Discipline,NOC,Event,Event gender,Medal
1924,Chamonix,Skating,Figure skating,AUT,individual,M,Silver
1924,Chamonix,Skating,Figure skating,AUT,individual,W,Gold
...
The original question is tagged "python-2.x", but for a Python 3 implementation (which requires only minor changes) see below.
Using pandas it is very simple to read a csv file directly from a url
import pandas as pd
data = pd.read_csv('https://example.com/passkey=wedsmdjsjmdd')
This will read your data in tabular format, which will be very easy to process
You could do it with the requests module as well:
url = 'http://winterolympicsmedals.com/medals.csv'
r = requests.get(url)
text = r.iter_lines()
reader = csv.reader(text, delimiter=',')
To increase performance when downloading a large file, the below may work a bit more efficiently:
import requests
from contextlib import closing
import csv
url = "http://download-and-process-csv-efficiently/python.csv"
with closing(requests.get(url, stream=True)) as r:
reader = csv.reader(r.iter_lines(), delimiter=',', quotechar='"')
for row in reader:
# Handle each row here...
print row
By setting stream=True
in the GET request, when we pass r.iter_lines()
to csv.reader(), we are passing a generator to csv.reader(). By doing so, we enable csv.reader() to lazily iterate over each line in the response with for row in reader
.
This avoids loading the entire file into memory before we start processing it, drastically reducing memory overhead for large files.
what you were trying to do with the curl command was to download the file to your local hard drive(HD). You however need to specify a path on HD
curl http://example.com/passkey=wedsmdjsjmdd -o ./example.csv
cr = csv.reader(open('./example.csv',"r"))
for row in cr:
print row
This question is tagged python-2.x
so it didn’t seem right to tamper with the original question, or the accepted answer. However, Python 2 is now unsupported, and this question still has good google juice for "python csv urllib", so here’s an updated Python 3 solution.
It’s now necessary to decode urlopen
‘s response (in bytes) into a valid local encoding, so the accepted answer has to be modified slightly:
import csv, urllib.request
url = 'http://winterolympicsmedals.com/medals.csv'
response = urllib.request.urlopen(url)
lines = [l.decode('utf-8') for l in response.readlines()]
cr = csv.reader(lines)
for row in cr:
print(row)
Note the extra line beginning with lines =
, the fact that urlopen
is now in the urllib.request
module, and print
of course requires parentheses.
It’s hardly advertised, but yes, csv.reader
can read from a list of strings.
And since someone else mentioned pandas, here’s a pandas rendition that displays the CSV in a console-friendly output:
python3 -c 'import pandas
df = pandas.read_csv("http://winterolympicsmedals.com/medals.csv")
print(df.to_string())'
Pandas is not a lightweight library, though. If you don’t need the things that pandas provides, or if startup time is important (e.g. you’re writing a command line utility or any other program that needs to load quickly), I’d advise that you stick with the standard library functions.
I am also using this approach for csv files (Python 3.6.9):
import csv
import io
import requests
r = requests.get(url)
buff = io.StringIO(r.text)
dr = csv.DictReader(buff)
for row in dr:
print(row)
All the above solutions didn’t work with Python3, I got all the "famous" error messages, like _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
and _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
. So I was a bit stuck here.
My mistake here, was that I used response.text
while response
is a requests.models.Response
class, while I should have used response.content
instead (as the first error suggested), so I was able to decode its UTF-8 correctly and split lines afterwards. So here is my solutions:
response = reqto.get("https://example.org/utf8-data.csv")
# Do some error checks to avoid bad results
if response.ok and len(response.content) > 0:
reader = csv.DictReader(response.content.decode('utf-8').splitlines(), dialect='unix')
for row in reader:
print(f"DEBUG: row={row}")
The above example gives me already a dict
back with each row. But with leading #
for each dict key, which I may have to live with.
when I do curl to a API call link http://example.com/passkey=wedsmdjsjmdd
curl 'http://example.com/passkey=wedsmdjsjmdd'
I get the employee output data on a csv file format, like:
"Steve","421","0","421","2","","","","","","","","","421","0","421","2"
how can parse through this using python.
I tried:
import csv
cr = csv.reader(open('http://example.com/passkey=wedsmdjsjmdd',"rb"))
for row in cr:
print row
but it didn’t work and I got an error
http://example.com/passkey=wedsmdjsjmdd No such file or directory:
Thanks!
You need to replace open
with urllib.urlopen or urllib2.urlopen.
e.g.
import csv
import urllib2
url = 'http://winterolympicsmedals.com/medals.csv'
response = urllib2.urlopen(url)
cr = csv.reader(response)
for row in cr:
print row
This would output the following
Year,City,Sport,Discipline,NOC,Event,Event gender,Medal
1924,Chamonix,Skating,Figure skating,AUT,individual,M,Silver
1924,Chamonix,Skating,Figure skating,AUT,individual,W,Gold
...
The original question is tagged "python-2.x", but for a Python 3 implementation (which requires only minor changes) see below.
Using pandas it is very simple to read a csv file directly from a url
import pandas as pd
data = pd.read_csv('https://example.com/passkey=wedsmdjsjmdd')
This will read your data in tabular format, which will be very easy to process
You could do it with the requests module as well:
url = 'http://winterolympicsmedals.com/medals.csv'
r = requests.get(url)
text = r.iter_lines()
reader = csv.reader(text, delimiter=',')
To increase performance when downloading a large file, the below may work a bit more efficiently:
import requests
from contextlib import closing
import csv
url = "http://download-and-process-csv-efficiently/python.csv"
with closing(requests.get(url, stream=True)) as r:
reader = csv.reader(r.iter_lines(), delimiter=',', quotechar='"')
for row in reader:
# Handle each row here...
print row
By setting stream=True
in the GET request, when we pass r.iter_lines()
to csv.reader(), we are passing a generator to csv.reader(). By doing so, we enable csv.reader() to lazily iterate over each line in the response with for row in reader
.
This avoids loading the entire file into memory before we start processing it, drastically reducing memory overhead for large files.
what you were trying to do with the curl command was to download the file to your local hard drive(HD). You however need to specify a path on HD
curl http://example.com/passkey=wedsmdjsjmdd -o ./example.csv
cr = csv.reader(open('./example.csv',"r"))
for row in cr:
print row
This question is tagged python-2.x
so it didn’t seem right to tamper with the original question, or the accepted answer. However, Python 2 is now unsupported, and this question still has good google juice for "python csv urllib", so here’s an updated Python 3 solution.
It’s now necessary to decode urlopen
‘s response (in bytes) into a valid local encoding, so the accepted answer has to be modified slightly:
import csv, urllib.request
url = 'http://winterolympicsmedals.com/medals.csv'
response = urllib.request.urlopen(url)
lines = [l.decode('utf-8') for l in response.readlines()]
cr = csv.reader(lines)
for row in cr:
print(row)
Note the extra line beginning with lines =
, the fact that urlopen
is now in the urllib.request
module, and print
of course requires parentheses.
It’s hardly advertised, but yes, csv.reader
can read from a list of strings.
And since someone else mentioned pandas, here’s a pandas rendition that displays the CSV in a console-friendly output:
python3 -c 'import pandas
df = pandas.read_csv("http://winterolympicsmedals.com/medals.csv")
print(df.to_string())'
Pandas is not a lightweight library, though. If you don’t need the things that pandas provides, or if startup time is important (e.g. you’re writing a command line utility or any other program that needs to load quickly), I’d advise that you stick with the standard library functions.
I am also using this approach for csv files (Python 3.6.9):
import csv
import io
import requests
r = requests.get(url)
buff = io.StringIO(r.text)
dr = csv.DictReader(buff)
for row in dr:
print(row)
All the above solutions didn’t work with Python3, I got all the "famous" error messages, like _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
and _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
. So I was a bit stuck here.
My mistake here, was that I used response.text
while response
is a requests.models.Response
class, while I should have used response.content
instead (as the first error suggested), so I was able to decode its UTF-8 correctly and split lines afterwards. So here is my solutions:
response = reqto.get("https://example.org/utf8-data.csv")
# Do some error checks to avoid bad results
if response.ok and len(response.content) > 0:
reader = csv.DictReader(response.content.decode('utf-8').splitlines(), dialect='unix')
for row in reader:
print(f"DEBUG: row={row}")
The above example gives me already a dict
back with each row. But with leading #
for each dict key, which I may have to live with.