Scrapy Unit Testing
Question:
I’d like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the “scrapy crawl” command I can run it through something like nose. Since scrapy is built on top of twisted can I use its unit testing framework Trial? If so, how? Otherwise I’d like to get nose working.
Update:
I’ve been talking on Scrapy-Users and I guess I am supposed to “build the Response in the test code, and then call the method with the response and assert that [I] get the expected items/requests in the output”. I can’t seem to get this to work though.
I can build a unit-test test class and in a test:
- create a response object
- try to call the parse method of my spider with the response object
However it ends up generating this traceback. Any insight as to why?
Answers:
You can follow this snippet from the scrapy site to run it from a script. Then you can make any kind of asserts you’d like on the returned items.
The way I’ve done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML.
A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have a big bug, but your test cases will still pass. So it may not be the best way to test this way.
My current workflow is, whenever there is an error I will sent an email to admin, with the url. Then for that specific error I create a html file with the content which is causing the error. Then I create a unittest for it.
This is the code I use to create sample Scrapy http responses for testing from an local html file:
# scrapyproject/tests/responses/__init__.py
import os
from scrapy.http import Response, Request
def fake_response_from_file(file_name, url=None):
"""
Create a Scrapy fake HTTP response from a HTML file
@param file_name: The relative filename from the responses directory,
but absolute paths are also accepted.
@param url: The URL of the response.
returns: A scrapy HTTP response which can be used for unittesting.
"""
if not url:
url = 'http://www.example.com'
request = Request(url=url)
if not file_name[0] == '/':
responses_dir = os.path.dirname(os.path.realpath(__file__))
file_path = os.path.join(responses_dir, file_name)
else:
file_path = file_name
file_content = open(file_path, 'r').read()
response = Response(url=url,
request=request,
body=file_content)
response.encoding = 'utf-8'
return response
The sample html file is located in scrapyproject/tests/responses/osdir/sample.html
Then the testcase could look as follows:
The test case location is scrapyproject/tests/test_osdir.py
import unittest
from scrapyproject.spiders import osdir_spider
from responses import fake_response_from_file
class OsdirSpiderTest(unittest.TestCase):
def setUp(self):
self.spider = osdir_spider.DirectorySpider()
def _test_item_results(self, results, expected_length):
count = 0
permalinks = set()
for item in results:
self.assertIsNotNone(item['content'])
self.assertIsNotNone(item['title'])
self.assertEqual(count, expected_length)
def test_parse(self):
results = self.spider.parse(fake_response_from_file('osdir/sample.html'))
self._test_item_results(results, 10)
That’s basically how I test my parsing methods, but its not only for parsing methods. If it gets more complex I suggest looking at Mox
The newly added Spider Contracts are worth trying. It gives you a simple way to add tests without requiring a lot of code.
I use Betamax to run test on real site the first time and keep http responses locally so that next tests run super fast after:
Betamax intercepts every request you make and attempts to find a matching request that has already been intercepted and recorded.
When you need to get latest version of site, just remove what betamax has recorded and re-run test.
Example:
from scrapy import Spider, Request
from scrapy.http import HtmlResponse
class Example(Spider):
name = 'example'
url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'
def start_requests(self):
yield Request(self.url, self.parse)
def parse(self, response):
for href in response.xpath('//a/@href').extract():
yield {'image_href': href}
# Test part
from betamax import Betamax
from betamax.fixtures.unittest import BetamaxTestCase
with Betamax.configure() as config:
# where betamax will store cassettes (http responses):
config.cassette_library_dir = 'cassettes'
config.preserve_exact_body_bytes = True
class TestExample(BetamaxTestCase): # superclass provides self.session
def test_parse(self):
example = Example()
# http response is recorded in a betamax cassette:
response = self.session.get(example.url)
# forge a scrapy response to test
scrapy_response = HtmlResponse(body=response.content, url=example.url)
result = example.parse(scrapy_response)
self.assertEqual({'image_href': u'image1.html'}, result.next())
self.assertEqual({'image_href': u'image2.html'}, result.next())
self.assertEqual({'image_href': u'image3.html'}, result.next())
self.assertEqual({'image_href': u'image4.html'}, result.next())
self.assertEqual({'image_href': u'image5.html'}, result.next())
with self.assertRaises(StopIteration):
result.next()
FYI, I discover betamax at pycon 2015 thanks to Ian Cordasco’s talk.
I’m using Twisted’s trial
to run tests, similar to Scrapy’s own tests. It already starts a reactor, so I make use of the CrawlerRunner
without worrying about starting and stopping one in the tests.
Stealing some ideas from the check
and parse
Scrapy commands I ended up with the following base TestCase
class to run assertions against live sites:
from twisted.trial import unittest
from scrapy.crawler import CrawlerRunner
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.spider import iterate_spider_output
class SpiderTestCase(unittest.TestCase):
def setUp(self):
self.runner = CrawlerRunner()
def make_test_class(self, cls, url):
"""
Make a class that proxies to the original class,
sets up a URL to be called, and gathers the items
and requests returned by the parse function.
"""
class TestSpider(cls):
# This is a once used class, so writing into
# the class variables is fine. The framework
# will instantiate it, not us.
items = []
requests = []
def start_requests(self):
req = super(TestSpider, self).make_requests_from_url(url)
req.meta["_callback"] = req.callback or self.parse
req.callback = self.collect_output
yield req
def collect_output(self, response):
try:
cb = response.request.meta["_callback"]
for x in iterate_spider_output(cb(response)):
if isinstance(x, (BaseItem, dict)):
self.items.append(x)
elif isinstance(x, Request):
self.requests.append(x)
except Exception as ex:
print("ERROR", "Could not execute callback: ", ex)
raise ex
# Returning any requests here would make the crawler follow them.
return None
return TestSpider
Example:
@defer.inlineCallbacks
def test_foo(self):
tester = self.make_test_class(FooSpider, 'https://foo.com')
yield self.runner.crawl(tester)
self.assertEqual(len(tester.items), 1)
self.assertEqual(len(tester.requests), 2)
or perform one request in the setup and run multiple tests against the results:
@defer.inlineCallbacks
def setUp(self):
super(FooTestCase, self).setUp()
if FooTestCase.tester is None:
FooTestCase.tester = self.make_test_class(FooSpider, 'https://foo.com')
yield self.runner.crawl(self.tester)
def test_foo(self):
self.assertEqual(len(self.tester.items), 1)
I’m using scrapy 1.3.0 and the function: fake_response_from_file, raise an error:
response = Response(url=url, request=request, body=file_content)
I get:
raise AttributeError("Response content isn't text")
The solution is to use TextResponse instead, and it works ok, as example:
response = TextResponse(url=url, request=request, body=file_content)
Thanks a lot.
Slightly simpler, by removing the def fake_response_from_file
from the chosen answer:
import unittest
from spiders.my_spider import MySpider
from scrapy.selector import Selector
class TestParsers(unittest.TestCase):
def setUp(self):
self.spider = MySpider(limit=1)
self.html = Selector(text=open("some.htm", 'r').read())
def test_some_parse(self):
expected = "some-text"
result = self.spider.some_parse(self.html)
self.assertEqual(result, expected)
if __name__ == '__main__':
unittest.main()
This is a very late answer but I’ve been annoyed with scrapy testing so I wrote scrapy-test a framework for testing scrapy crawlers against defined specifications.
It works by defining test specifications rather than static output.
For example if we are crawling this sort of item:
{
"name": "Alex",
"age": 21,
"gender": "Female",
}
We can defined scrapy-test ItemSpec
:
from scrapytest.tests import Match, MoreThan, LessThan
from scrapytest.spec import ItemSpec
class MySpec(ItemSpec):
name_test = Match('{3,}') # name should be at least 3 characters long
age_test = Type(int), MoreThan(18), LessThan(99)
gender_test = Match('Female|Male')
There’s also same idea tests for scrapy stats as StatsSpec
:
from scrapytest.spec import StatsSpec
from scrapytest.tests import Morethan
class MyStatsSpec(StatsSpec):
validate = {
"item_scraped_count": MoreThan(0),
}
Afterwards it can be run against live or cached results:
$ scrapy-test
# or
$ scrapy-test --cache
I’ve been running cached runs for development changes and daily cronjobs for detecting website changes.
https://github.com/ThomasAitken/Scrapy-Testmaster
This is a package I wrote that significantly extends the functionality of the Scrapy Autounit library and takes it in a different direction (allowing for easy dynamic updating of testcases and merging the processes of debugging/testcase-generation). It also includes a modified version of the Scrapy parse
command (https://docs.scrapy.org/en/latest/topics/commands.html#std-command-parse)
Similar to Hadrien’s answer but for pytest: pytest-vcr.
import requests
import pytest
from scrapy.http import HtmlResponse
@pytest.mark.vcr()
def test_parse(url, target):
response = requests.get(url)
scrapy_response = HtmlResponse(url, body=response.content)
assert Spider().parse(scrapy_response) == target
I’d like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the “scrapy crawl” command I can run it through something like nose. Since scrapy is built on top of twisted can I use its unit testing framework Trial? If so, how? Otherwise I’d like to get nose working.
Update:
I’ve been talking on Scrapy-Users and I guess I am supposed to “build the Response in the test code, and then call the method with the response and assert that [I] get the expected items/requests in the output”. I can’t seem to get this to work though.
I can build a unit-test test class and in a test:
- create a response object
- try to call the parse method of my spider with the response object
However it ends up generating this traceback. Any insight as to why?
You can follow this snippet from the scrapy site to run it from a script. Then you can make any kind of asserts you’d like on the returned items.
The way I’ve done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML.
A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have a big bug, but your test cases will still pass. So it may not be the best way to test this way.
My current workflow is, whenever there is an error I will sent an email to admin, with the url. Then for that specific error I create a html file with the content which is causing the error. Then I create a unittest for it.
This is the code I use to create sample Scrapy http responses for testing from an local html file:
# scrapyproject/tests/responses/__init__.py
import os
from scrapy.http import Response, Request
def fake_response_from_file(file_name, url=None):
"""
Create a Scrapy fake HTTP response from a HTML file
@param file_name: The relative filename from the responses directory,
but absolute paths are also accepted.
@param url: The URL of the response.
returns: A scrapy HTTP response which can be used for unittesting.
"""
if not url:
url = 'http://www.example.com'
request = Request(url=url)
if not file_name[0] == '/':
responses_dir = os.path.dirname(os.path.realpath(__file__))
file_path = os.path.join(responses_dir, file_name)
else:
file_path = file_name
file_content = open(file_path, 'r').read()
response = Response(url=url,
request=request,
body=file_content)
response.encoding = 'utf-8'
return response
The sample html file is located in scrapyproject/tests/responses/osdir/sample.html
Then the testcase could look as follows:
The test case location is scrapyproject/tests/test_osdir.py
import unittest
from scrapyproject.spiders import osdir_spider
from responses import fake_response_from_file
class OsdirSpiderTest(unittest.TestCase):
def setUp(self):
self.spider = osdir_spider.DirectorySpider()
def _test_item_results(self, results, expected_length):
count = 0
permalinks = set()
for item in results:
self.assertIsNotNone(item['content'])
self.assertIsNotNone(item['title'])
self.assertEqual(count, expected_length)
def test_parse(self):
results = self.spider.parse(fake_response_from_file('osdir/sample.html'))
self._test_item_results(results, 10)
That’s basically how I test my parsing methods, but its not only for parsing methods. If it gets more complex I suggest looking at Mox
The newly added Spider Contracts are worth trying. It gives you a simple way to add tests without requiring a lot of code.
I use Betamax to run test on real site the first time and keep http responses locally so that next tests run super fast after:
Betamax intercepts every request you make and attempts to find a matching request that has already been intercepted and recorded.
When you need to get latest version of site, just remove what betamax has recorded and re-run test.
Example:
from scrapy import Spider, Request
from scrapy.http import HtmlResponse
class Example(Spider):
name = 'example'
url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'
def start_requests(self):
yield Request(self.url, self.parse)
def parse(self, response):
for href in response.xpath('//a/@href').extract():
yield {'image_href': href}
# Test part
from betamax import Betamax
from betamax.fixtures.unittest import BetamaxTestCase
with Betamax.configure() as config:
# where betamax will store cassettes (http responses):
config.cassette_library_dir = 'cassettes'
config.preserve_exact_body_bytes = True
class TestExample(BetamaxTestCase): # superclass provides self.session
def test_parse(self):
example = Example()
# http response is recorded in a betamax cassette:
response = self.session.get(example.url)
# forge a scrapy response to test
scrapy_response = HtmlResponse(body=response.content, url=example.url)
result = example.parse(scrapy_response)
self.assertEqual({'image_href': u'image1.html'}, result.next())
self.assertEqual({'image_href': u'image2.html'}, result.next())
self.assertEqual({'image_href': u'image3.html'}, result.next())
self.assertEqual({'image_href': u'image4.html'}, result.next())
self.assertEqual({'image_href': u'image5.html'}, result.next())
with self.assertRaises(StopIteration):
result.next()
FYI, I discover betamax at pycon 2015 thanks to Ian Cordasco’s talk.
I’m using Twisted’s trial
to run tests, similar to Scrapy’s own tests. It already starts a reactor, so I make use of the CrawlerRunner
without worrying about starting and stopping one in the tests.
Stealing some ideas from the check
and parse
Scrapy commands I ended up with the following base TestCase
class to run assertions against live sites:
from twisted.trial import unittest
from scrapy.crawler import CrawlerRunner
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.spider import iterate_spider_output
class SpiderTestCase(unittest.TestCase):
def setUp(self):
self.runner = CrawlerRunner()
def make_test_class(self, cls, url):
"""
Make a class that proxies to the original class,
sets up a URL to be called, and gathers the items
and requests returned by the parse function.
"""
class TestSpider(cls):
# This is a once used class, so writing into
# the class variables is fine. The framework
# will instantiate it, not us.
items = []
requests = []
def start_requests(self):
req = super(TestSpider, self).make_requests_from_url(url)
req.meta["_callback"] = req.callback or self.parse
req.callback = self.collect_output
yield req
def collect_output(self, response):
try:
cb = response.request.meta["_callback"]
for x in iterate_spider_output(cb(response)):
if isinstance(x, (BaseItem, dict)):
self.items.append(x)
elif isinstance(x, Request):
self.requests.append(x)
except Exception as ex:
print("ERROR", "Could not execute callback: ", ex)
raise ex
# Returning any requests here would make the crawler follow them.
return None
return TestSpider
Example:
@defer.inlineCallbacks
def test_foo(self):
tester = self.make_test_class(FooSpider, 'https://foo.com')
yield self.runner.crawl(tester)
self.assertEqual(len(tester.items), 1)
self.assertEqual(len(tester.requests), 2)
or perform one request in the setup and run multiple tests against the results:
@defer.inlineCallbacks
def setUp(self):
super(FooTestCase, self).setUp()
if FooTestCase.tester is None:
FooTestCase.tester = self.make_test_class(FooSpider, 'https://foo.com')
yield self.runner.crawl(self.tester)
def test_foo(self):
self.assertEqual(len(self.tester.items), 1)
I’m using scrapy 1.3.0 and the function: fake_response_from_file, raise an error:
response = Response(url=url, request=request, body=file_content)
I get:
raise AttributeError("Response content isn't text")
The solution is to use TextResponse instead, and it works ok, as example:
response = TextResponse(url=url, request=request, body=file_content)
Thanks a lot.
Slightly simpler, by removing the def fake_response_from_file
from the chosen answer:
import unittest
from spiders.my_spider import MySpider
from scrapy.selector import Selector
class TestParsers(unittest.TestCase):
def setUp(self):
self.spider = MySpider(limit=1)
self.html = Selector(text=open("some.htm", 'r').read())
def test_some_parse(self):
expected = "some-text"
result = self.spider.some_parse(self.html)
self.assertEqual(result, expected)
if __name__ == '__main__':
unittest.main()
This is a very late answer but I’ve been annoyed with scrapy testing so I wrote scrapy-test a framework for testing scrapy crawlers against defined specifications.
It works by defining test specifications rather than static output.
For example if we are crawling this sort of item:
{
"name": "Alex",
"age": 21,
"gender": "Female",
}
We can defined scrapy-test ItemSpec
:
from scrapytest.tests import Match, MoreThan, LessThan
from scrapytest.spec import ItemSpec
class MySpec(ItemSpec):
name_test = Match('{3,}') # name should be at least 3 characters long
age_test = Type(int), MoreThan(18), LessThan(99)
gender_test = Match('Female|Male')
There’s also same idea tests for scrapy stats as StatsSpec
:
from scrapytest.spec import StatsSpec
from scrapytest.tests import Morethan
class MyStatsSpec(StatsSpec):
validate = {
"item_scraped_count": MoreThan(0),
}
Afterwards it can be run against live or cached results:
$ scrapy-test
# or
$ scrapy-test --cache
I’ve been running cached runs for development changes and daily cronjobs for detecting website changes.
https://github.com/ThomasAitken/Scrapy-Testmaster
This is a package I wrote that significantly extends the functionality of the Scrapy Autounit library and takes it in a different direction (allowing for easy dynamic updating of testcases and merging the processes of debugging/testcase-generation). It also includes a modified version of the Scrapy parse
command (https://docs.scrapy.org/en/latest/topics/commands.html#std-command-parse)
Similar to Hadrien’s answer but for pytest: pytest-vcr.
import requests
import pytest
from scrapy.http import HtmlResponse
@pytest.mark.vcr()
def test_parse(url, target):
response = requests.get(url)
scrapy_response = HtmlResponse(url, body=response.content)
assert Spider().parse(scrapy_response) == target