Python basics – request data from API and write to a file
Question:
I am trying to use “requests” package and retrieve info from Github, like the Requests doc page explains:
import requests
r = requests.get('https://api.github.com/events')
And this:
with open(filename, 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
I have to say I don’t understand the second code block.
- filename – in what form do I provide the path to the file if created? where will it be saved if not?
- ‘wb’ – what is this variable? (shouldn’t second parameter be ‘mode’?)
- following two lines probably iterate over data retrieved with request and write to the file
Python docs explanation also not helping much.
EDIT: What I am trying to do:
- use Requests to connect to an API (Github and later Facebook GraphAPI)
- retrieve data into a variable
- write this into a file (later, as I get more familiar with Python, into my local MySQL database)
Answers:
filename
is a string of the path you want to save it at. It accepts either local or absolute path, so you can just have filename = 'example.html'
wb
stands for WRITE
& BYTES
, learn more here
The for loop goes over the entire returned content (in chunks incase it is too large for proper memory handling), and then writes them until there are no more. Useful for large files, but for a single webpage you could just do:
# just W becase we are not writing as bytes anymore, just text.
with open(filename, 'w') as fd:
fd.write(r.content)
Filename
When using open
the path is relative to your current directory. So if you said open('file.txt','w')
it would create a new file named file.txt
in whatever folder your python script is in. You can also specify an absolute path, for example /home/user/file.txt
in linux. If a file by the name 'file.txt'
already exists, the contents will be completely overwritten.
Mode
The 'wb'
option is indeed the mode. The 'w'
means write and the 'b'
means bytes. You use 'w'
when you want to write (rather than read) froma file, and you use 'b'
for binary files (rather than text files). It is actually a little odd to use 'b'
in this case, as the content you are writing is a text file. Specifying 'w'
would work just as well here. Read more on the modes in the docs for open.
The Loop
This part is using the iter_content
method from requests
, which is intended for use with large files that you may not want in memory all at once. This is unnecessary in this case, since the page in question is only 89 KB. See the requests library docs for more info.
Conclusion
The example you are looking at is meant to handle the most general case, in which the remote file might be binary and too big to be in memory. However, we can make your code more readable and easy to understand if you are only accessing small webpages containing text:
import requests
r = requests.get('https://api.github.com/events')
with open('events.txt','w') as fd:
fd.write(r.text)
I have a response in JSON format. I would like to write as JSON file.
with open('/dbfs/tmp/response.json','w') as fd:
fd.write(r.text)
Then, I want to read this data into a dataframe. It is reading as corrupt record.
How do I read into a data frame nicely?
df = spark.read.format(‘org.apache.spark.sql.json’).load("/tmp/response.json")
I am trying to use “requests” package and retrieve info from Github, like the Requests doc page explains:
import requests
r = requests.get('https://api.github.com/events')
And this:
with open(filename, 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
I have to say I don’t understand the second code block.
- filename – in what form do I provide the path to the file if created? where will it be saved if not?
- ‘wb’ – what is this variable? (shouldn’t second parameter be ‘mode’?)
- following two lines probably iterate over data retrieved with request and write to the file
Python docs explanation also not helping much.
EDIT: What I am trying to do:
- use Requests to connect to an API (Github and later Facebook GraphAPI)
- retrieve data into a variable
- write this into a file (later, as I get more familiar with Python, into my local MySQL database)
filename
is a string of the path you want to save it at. It accepts either local or absolute path, so you can just have filename = 'example.html'
wb
stands for WRITE
& BYTES
, learn more here
The for loop goes over the entire returned content (in chunks incase it is too large for proper memory handling), and then writes them until there are no more. Useful for large files, but for a single webpage you could just do:
# just W becase we are not writing as bytes anymore, just text.
with open(filename, 'w') as fd:
fd.write(r.content)
Filename
When using open
the path is relative to your current directory. So if you said open('file.txt','w')
it would create a new file named file.txt
in whatever folder your python script is in. You can also specify an absolute path, for example /home/user/file.txt
in linux. If a file by the name 'file.txt'
already exists, the contents will be completely overwritten.
Mode
The 'wb'
option is indeed the mode. The 'w'
means write and the 'b'
means bytes. You use 'w'
when you want to write (rather than read) froma file, and you use 'b'
for binary files (rather than text files). It is actually a little odd to use 'b'
in this case, as the content you are writing is a text file. Specifying 'w'
would work just as well here. Read more on the modes in the docs for open.
The Loop
This part is using the iter_content
method from requests
, which is intended for use with large files that you may not want in memory all at once. This is unnecessary in this case, since the page in question is only 89 KB. See the requests library docs for more info.
Conclusion
The example you are looking at is meant to handle the most general case, in which the remote file might be binary and too big to be in memory. However, we can make your code more readable and easy to understand if you are only accessing small webpages containing text:
import requests
r = requests.get('https://api.github.com/events')
with open('events.txt','w') as fd:
fd.write(r.text)
I have a response in JSON format. I would like to write as JSON file.
with open('/dbfs/tmp/response.json','w') as fd:
fd.write(r.text)
Then, I want to read this data into a dataframe. It is reading as corrupt record.
How do I read into a data frame nicely?
df = spark.read.format(‘org.apache.spark.sql.json’).load("/tmp/response.json")