Scraping a JSON response with Scrapy
Question:
How do you use Scrapy to scrape web requests that return JSON? For example, the JSON would look like this:
{
"firstName": "John",
"lastName": "Smith",
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021"
},
"phoneNumber": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "fax",
"number": "646 555-4567"
}
]
}
I would be looking to scrape specific items (e.g. name
and fax
in the above) and save to csv.
Answers:
It’s the same as using Scrapy’s HtmlXPathSelector
for html responses. The only difference is that you should use json
module to parse the response:
class MySpider(BaseSpider):
...
def parse(self, response):
jsonresponse = json.loads(response.text)
item = MyItem()
item["firstName"] = jsonresponse["firstName"]
return item
The possible reason JSON is not loading is that it has single-quotes before and after. Try this:
json.loads(response.body_as_unicode().replace("'", '"'))
Don’t need to use json
module to parse the reponse object.
class MySpider(BaseSpider):
...
def parse(self, response):
jsonresponse = response.json()
item = MyItem()
item["firstName"] = jsonresponse.get("firstName", "")
return item
How do you use Scrapy to scrape web requests that return JSON? For example, the JSON would look like this:
{
"firstName": "John",
"lastName": "Smith",
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021"
},
"phoneNumber": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "fax",
"number": "646 555-4567"
}
]
}
I would be looking to scrape specific items (e.g. name
and fax
in the above) and save to csv.
It’s the same as using Scrapy’s HtmlXPathSelector
for html responses. The only difference is that you should use json
module to parse the response:
class MySpider(BaseSpider):
...
def parse(self, response):
jsonresponse = json.loads(response.text)
item = MyItem()
item["firstName"] = jsonresponse["firstName"]
return item
The possible reason JSON is not loading is that it has single-quotes before and after. Try this:
json.loads(response.body_as_unicode().replace("'", '"'))
Don’t need to use json
module to parse the reponse object.
class MySpider(BaseSpider):
...
def parse(self, response):
jsonresponse = response.json()
item = MyItem()
item["firstName"] = jsonresponse.get("firstName", "")
return item