passing selenium response url to scrapy
Use Downloader Middleware to catch selenium-required pages before you process them regularly with Scrapy:
The downloader middleware is a framework of hooks into Scrapy’s request/response processing. It’s a light, low-level system for globally altering Scrapy’s requests and responses.
Here's a very basic example using PhantomJS:
from scrapy.http import HtmlResponsefrom selenium import webdriverclass JSMiddleware(object): def process_request(self, request, spider): driver = webdriver.PhantomJS() driver.get(request.url) body = driver.page_source return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)
Once you return that HtmlResponse
(or a TextResponse
if that's what you really want), Scrapy will cease processing downloaders and drop into the spider's parse
method:
If it returns a Response object, Scrapy won’t bother calling any other process_request() or process_exception() methods, or the appropriate download function; it’ll return that response. The process_response() methods of installed middleware is always called on every response.
In this case, you can continue to use your spider's parse
method as you normally would with HTML, except that the JS on the page has already been executed.
Tip: Since the Downloader Middleware's process_request
method accepts the spider as an argument, you can add a conditional in the spider to check whether you need to process JS at all, and that will let you handle both JS and non-JS pages with the exact same spider class.
Here is a middleware for Scrapy and Selenium
from scrapy.http import HtmlResponsefrom scrapy.utils.python import to_bytesfrom selenium import webdriverfrom scrapy import signalsclass SeleniumMiddleware(object): @classmethod def from_crawler(cls, crawler): middleware = cls() crawler.signals.connect(middleware.spider_opened, signals.spider_opened) crawler.signals.connect(middleware.spider_closed, signals.spider_closed) return middleware def process_request(self, request, spider): request.meta['driver'] = self.driver # to access driver from response self.driver.get(request.url) body = to_bytes(self.driver.page_source) # body must be of type bytes return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request) def spider_opened(self, spider): self.driver = webdriver.Firefox() def spider_closed(self, spider): self.driver.close()
Also need to add in settings.py
DOWNLOADER_MIDDLEWARES = { 'youproject.middlewares.selenium.SeleniumMiddleware': 200}
Decide weather its 200
or something else based on docs.
Update firefox headless mode with scrapy and selenium
If you want to run firefox in headless mode then install xvfb
sudo apt-get install -y xvfb
and PyVirtualDisplay
sudo pip install pyvirtualdisplay
and use this middleware
from shutil import whichfrom pyvirtualdisplay import Displayfrom scrapy import signalsfrom scrapy.http import HtmlResponsefrom scrapy.utils.project import get_project_settingsfrom selenium import webdriverfrom selenium.webdriver.firefox.firefox_binary import FirefoxBinarysettings = get_project_settings()HEADLESS = Trueclass SeleniumMiddleware(object): @classmethod def from_crawler(cls, crawler): middleware = cls() crawler.signals.connect(middleware.spider_opened, signals.spider_opened) crawler.signals.connect(middleware.spider_closed, signals.spider_closed) return middleware def process_request(self, request, spider): self.driver.get(request.url) request.meta['driver'] = self.driver body = str.encode(self.driver.page_source) return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request) def spider_opened(self, spider): if HEADLESS: self.display = Display(visible=0, size=(1280, 1024)) self.display.start() binary = FirefoxBinary(settings.get('FIREFOX_EXE') or which('firefox')) self.driver = webdriver.Firefox(firefox_binary=binary) def spider_closed(self, spider): self.driver.close() if HEADLESS: self.display.stop()
where settings.py
contains
FIREFOX_EXE = '/path/to/firefox.exe'
The problem is that some versions of firefox don't work with selenium.To solve this problem you can download firefox version 47.0.1 (this version works flawlessly) from here then extract it and put it inside your selenium project. Afterwards modify firefox path as
FIREFOX_EXE = '/path/to/your/scrapyproject/firefox/firefox.exe'