Scrapy Unit Testing Scrapy Unit Testing python python

Scrapy Unit Testing


The way I've done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML.

A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have a big bug, but your test cases will still pass. So it may not be the best way to test this way.

My current workflow is, whenever there is an error I will sent an email to admin, with the url. Then for that specific error I create a html file with the content which is causing the error. Then I create a unittest for it.

This is the code I use to create sample Scrapy http responses for testing from an local html file:

# scrapyproject/tests/responses/__init__.pyimport osfrom scrapy.http import Response, Requestdef fake_response_from_file(file_name, url=None):    """    Create a Scrapy fake HTTP response from a HTML file    @param file_name: The relative filename from the responses directory,                      but absolute paths are also accepted.    @param url: The URL of the response.    returns: A scrapy HTTP response which can be used for unittesting.    """    if not url:        url = 'http://www.example.com'    request = Request(url=url)    if not file_name[0] == '/':        responses_dir = os.path.dirname(os.path.realpath(__file__))        file_path = os.path.join(responses_dir, file_name)    else:        file_path = file_name    file_content = open(file_path, 'r').read()    response = Response(url=url,        request=request,        body=file_content)    response.encoding = 'utf-8'    return response

The sample html file is located in scrapyproject/tests/responses/osdir/sample.html

Then the testcase could look as follows:The test case location is scrapyproject/tests/test_osdir.py

import unittestfrom scrapyproject.spiders import osdir_spiderfrom responses import fake_response_from_fileclass OsdirSpiderTest(unittest.TestCase):    def setUp(self):        self.spider = osdir_spider.DirectorySpider()    def _test_item_results(self, results, expected_length):        count = 0        permalinks = set()        for item in results:            self.assertIsNotNone(item['content'])            self.assertIsNotNone(item['title'])        self.assertEqual(count, expected_length)    def test_parse(self):        results = self.spider.parse(fake_response_from_file('osdir/sample.html'))        self._test_item_results(results, 10)

That's basically how I test my parsing methods, but its not only for parsing methods. If it gets more complex I suggest looking at Mox


I use Betamax to run test on real site the first time and keep http responses locally so that next tests run super fast after:

Betamax intercepts every request you make and attempts to find a matching request that has already been intercepted and recorded.

When you need to get latest version of site, just remove what betamax has recorded and re-run test.

Example:

from scrapy import Spider, Requestfrom scrapy.http import HtmlResponseclass Example(Spider):    name = 'example'    url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'    def start_requests(self):        yield Request(self.url, self.parse)    def parse(self, response):        for href in response.xpath('//a/@href').extract():            yield {'image_href': href}# Test partfrom betamax import Betamaxfrom betamax.fixtures.unittest import BetamaxTestCasewith Betamax.configure() as config:    # where betamax will store cassettes (http responses):    config.cassette_library_dir = 'cassettes'    config.preserve_exact_body_bytes = Trueclass TestExample(BetamaxTestCase):  # superclass provides self.session    def test_parse(self):        example = Example()        # http response is recorded in a betamax cassette:        response = self.session.get(example.url)        # forge a scrapy response to test        scrapy_response = HtmlResponse(body=response.content, url=example.url)        result = example.parse(scrapy_response)        self.assertEqual({'image_href': u'image1.html'}, result.next())        self.assertEqual({'image_href': u'image2.html'}, result.next())        self.assertEqual({'image_href': u'image3.html'}, result.next())        self.assertEqual({'image_href': u'image4.html'}, result.next())        self.assertEqual({'image_href': u'image5.html'}, result.next())        with self.assertRaises(StopIteration):            result.next()

FYI, I discover betamax at pycon 2015 thanks to Ian Cordasco's talk.


The newly added Spider Contracts are worth trying. It gives you a simple way to add tests without requiring a lot of code.