selenium with scrapy for dynamic page

python selenium selenium-webdriver web-scraping scrapy

It really depends on how do you need to scrape the site and how and what data do you want to get.

Here's an example how you can follow pagination on ebay using Scrapy+Selenium:

import scrapyfrom selenium import webdriverclass ProductSpider(scrapy.Spider):    name = "product_spider"    allowed_domains = ['ebay.com']    start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']    def __init__(self):        self.driver = webdriver.Firefox()    def parse(self, response):        self.driver.get(response.url)        while True:            next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')            try:                next.click()                # get the data and write it to scrapy items            except:                break        self.driver.close()

Here are some examples of "selenium spiders":

There is also an alternative to having to use Selenium with Scrapy. In some cases, using ScrapyJS middleware is enough to handle the dynamic parts of a page. Sample real-world usage:

Scraping dynamic content using python-Scrapy

python selenium selenium-webdriver web-scraping scrapy

If (url doesn't change between the two pages) then you should add dont_filter=True with your scrapy.Request() or scrapy will find this url as a duplicate after processing first page.

If you need to render pages with javascript you should use scrapy-splash, you can also check this scrapy middleware which can handle javascript pages using selenium or you can do that by launching any headless browser

But more effective and faster solution is inspect your browser and see what requests are made during submitting a form or triggering a certain event. Try to simulate the same requests as your browser sends. If you can replicate the request(s) correctly you will get the data you need.

Here is an example :

class ScrollScraper(Spider):    name = "scrollingscraper"    quote_url = "http://quotes.toscrape.com/api/quotes?page="    start_urls = [quote_url + "1"]    def parse(self, response):        quote_item = QuoteItem()        print response.body        data = json.loads(response.body)        for item in data.get('quotes', []):            quote_item["author"] = item.get('author', {}).get('name')            quote_item['quote'] = item.get('text')            quote_item['tags'] = item.get('tags')            yield quote_item        if data['has_next']:            next_page = data['page'] + 1            yield Request(self.quote_url + str(next_page))

When pagination url is same for every pages & uses POST request then you can use scrapy.FormRequest() instead of scrapy.Request(), both are same but FormRequest adds a new argument (formdata=) to the constructor.

Here is another spider example form this post:

class SpiderClass(scrapy.Spider):    # spider name and all    name = 'ajax'    page_incr = 1    start_urls = ['http://www.pcguia.pt/category/reviews/#paginated=1']    pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'    def parse(self, response):        sel = Selector(response)        if self.page_incr > 1:            json_data = json.loads(response.body)            sel = Selector(text=json_data.get('content', ''))        # your code here        # pagination code starts here        if sel.xpath('//div[@class="panel-wrapper"]'):            self.page_incr += 1            formdata = {                'sorter': 'recent',                'location': 'main loop',                'loop': 'main loop',                'action': 'sort',                'view': 'grid',                'columns': '3',                'paginated': str(self.page_incr),                'currentquery[category_name]': 'reviews'            }            yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)        else:            return

CodeHunter

selenium with scrapy for dynamic page

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last