how to run spider multiple times with different input

python selenium web-scraping scrapy web-crawler

When you call process.start() Scrapy's CrawlerProcess will start a Twisted reactor that by default will stop when the crawlers are finished and it's not supposed to be restarted. One possible solution you can try is executing with stop_after_crawl param set to False:

 process.start(stop_after_crawl=False)

This will prevent the reactor to stop, bypassing the restart problem. Although I can't say it won't lead to other problems further, so you should test it to be sure.

In the documentation there is also an example to running multiple spiders in the same process, one of which actively runs/stops the reactor, but it uses CrawlerRunner instead of CrawlerProcess.

Finally, if the solutions above don't help, I would suggest trying this:

if __name__ == '__main__':    process = CrawlerProcess(settings={        "FEEDS": {            "itemtmall.csv": {"format": "csv",                              'fields': ['product_name_tmall', 'product_price_tmall', 'product_discount_tmall'], },            "itemjd.csv": {"format": "csv",                           'fields': ['product_name_jd', 'product_price_jd', 'product_discount_jd'], },    })    for a in range(len(inputlist)):        process.crawl(tmallSpider)        process.crawl(jdSpider)    process.start()

The point here is that the process is started only once outside the loop, and the CrawlerProcess instantiation is also outside the loop, otherwise every iteration would overwrite the previous instance of the CrawlerProcess.

python selenium web-scraping scrapy web-crawler

The process should be started after all of the spiders are set up like it can be seen here:

https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

In your case scenario, a little bit more of code would've helped but I suppose that setting up all the crawl processes for both of the spiders for all of the prodcts and then firing up the start() function.

if __name__ == '__main__':    for a in range(len(inputlist)):        process = CrawlerProcess(settings={            "FEEDS": {                "itemtmall.csv": {"format": "csv",                                  'fields': ['product_name_tmall', 'product_price_tmall', 'product_discount_tmall'], },                "itemjd.csv": {"format": "csv",                               'fields': ['product_name_jd', 'product_price_jd', 'product_discount_jd'], },        })        process.crawl(tmallSpider)        process.crawl(jdSpider)    process.start()

python selenium web-scraping scrapy web-crawler

Using class that should be easy,there are few things to notice when using scrapy.Request

callback is a method that handles response

dont_filter=True allows you to request same pages multiple times

errback is a method that handles responses with errors

you can yield Request anytime and it will be added to the pool

import scrapyclass GSMArenaSpider(scrapy.Spider):    name = "smartmania"    main_url = ['https://smartmania.cz/zarizeni/telefony/']  # you can put as many starting links as you want        def start_requests(self):                for url in GSMArenaSpider.main_url:            self.my_logger.debug(f"Starting Scrapy @ {url}")            yield scrapy.Request(url=url, callback=self.parse_pages, errback=self.errback_httpbin)  # You can bind any parsing method you need            yield scrapy.Request(url=url, callback=self.parse_ipads, errback=self.errback_httpbin)  # You can bind any parsing method you need            yield scrapy.Request(url=url, callback=self.parse_iphones, errback=self.errback_httpbin)  # You can bind any parsing method you need    def parse_pages(self, response):        # parsing results        #        for url in result:            self.my_logger.info(f"Found pages: {url}")            yield scrapy.Request(url=url, callback=self.parse_phone_links, errback=self.errback_httpbin,                                 dont_filter=True)            yield scrapy.Request(url=url, callback=self.parse_pages, errback=self.errback_httpbin, dont_filter=False)  # Be careful when doing recursion requests and not using filter    def errback_httpbin(self, failure):        """ Handling Errors """        url = failure.request.url        callback = failure.request.callback        errback = failure.request.errback  # should work same way as callback... ?        status = failure.value.response.status        self.my_logger.error(f"Fail status: {status} @: {url}")

CodeHunter

how to run spider multiple times with different input

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last