how to filter duplicate requests based on url in scrapy how to filter duplicate requests based on url in scrapy python python

how to filter duplicate requests based on url in scrapy


You can write custom middleware for duplicate removal and add it in settings

import osfrom scrapy.dupefilter import RFPDupeFilterclass CustomFilter(RFPDupeFilter):"""A dupe filter that considers specific ids in the url"""    def __getid(self, url):        mm = url.split("&refer")[0] #or something like that        return mm    def request_seen(self, request):        fp = self.__getid(request.url)        if fp in self.fingerprints:            return True        self.fingerprints.add(fp)        if self.file:            self.file.write(fp + os.linesep)

Then you need to set the correct DUPFILTER_CLASS in settings.py

DUPEFILTER_CLASS = 'scraper.duplicate_filter.CustomFilter'

It should work after that


Following ytomar's lead, I wrote this filter that filters based purely on URLs that have already been seen by checking an in-memory set. I'm a Python noob so let me know if I screwed something up, but it seems to work all right:

from scrapy.dupefilter import RFPDupeFilterclass SeenURLFilter(RFPDupeFilter):    """A dupe filter that considers the URL"""    def __init__(self, path=None):        self.urls_seen = set()        RFPDupeFilter.__init__(self, path)    def request_seen(self, request):        if request.url in self.urls_seen:            return True        else:            self.urls_seen.add(request.url)

As ytomar mentioned, be sure to add the DUPEFILTER_CLASS constant to settings.py:

DUPEFILTER_CLASS = 'scraper.custom_filters.SeenURLFilter'


https://github.com/scrapinghub/scrapylib/blob/master/scrapylib/deltafetch.py

This file might help you. This file creates a database of unique delta fetch key from the url ,a user pass in a scrapy.Reqeust(meta={'deltafetch_key':uniqe_url_key}).This this let you avoid duplicate requests you already have visited in the past.

A sample mongodb implementation using deltafetch.py

        if isinstance(r, Request):            key = self._get_key(r)            key = key+spider.name            if self.db['your_collection_to_store_deltafetch_key'].find_one({"_id":key}):                spider.log("Ignoring already visited: %s" % r, level=log.INFO)                continue        elif isinstance(r, BaseItem):            key = self._get_key(response.request)            key = key+spider.name            try:                self.db['your_collection_to_store_deltafetch_key'].insert({"_id":key,"time":datetime.now()})            except:                spider.log("Ignoring already visited: %s" % key, level=log.ERROR)        yield r

eg. id = 345scrapy.Request(url,meta={deltafetch_key:345},callback=parse)