Python selenium multiprocessing Python selenium multiprocessing selenium selenium

Python selenium multiprocessing


how can I reduce the execution time using selenium when it is made to run using multiprocessing

A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread:

(... skipped for brevity ...)threadLocal = threading.local()def get_driver():  driver = getattr(threadLocal, 'driver', None)  if driver is None:    chromeOptions = webdriver.ChromeOptions()    chromeOptions.add_argument("--headless")    driver = webdriver.Chrome(chrome_options=chromeOptions)    setattr(threadLocal, 'driver', driver)  return driverdef get_title(url):  driver = get_driver()  driver.get(url)  (...)(...)

On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement. To test yourself, download the full script.

Note: ThreadPool uses threads, which are constrained by the Python GIL. That's ok if for the most part the task is I/O bound. Depending on the post-processing you do with the scraped results, you may want to use a multiprocessing.Pool instead. This launches parallel processes which as a group are not constrained by the GIL. The rest of the code stays the same.


The one potential problem I see with the clever one-driver-per-thread answer is that it omits any mechanism for "quitting" the drivers and thus leaving the possibility of processes hanging around. I would make the following changes:

  1. Use instead class Driver that will crate the driver instance and store it on the thread-local storage but also have a destructor that will quit the driver when the thread-local storage is deleted:
class Driver:    def __init__(self):        options = webdriver.ChromeOptions()        options.add_argument("--headless")        self.driver = webdriver.Chrome(options=options)    def __del__(self):        self.driver.quit() # clean up driver when we are cleaned up        #print('The driver has been "quitted".')
  1. create_driver now becomes:
threadLocal = threading.local()def create_driver():    the_driver = getattr(threadLocal, 'the_driver', None)    if the_driver is None:        the_driver = Driver()        setattr(threadLocal, 'the_driver', the_driver)    return the_driver.driver
  1. Finally, after you have no further use for the ThreadPool instance but before it is terminated, add the following lines to delete the thread-local storage and force the Driver instances' destructors to be called (hopefully):
del threadLocalimport gcgc.collect() # a little extra insurance


My question: how can I reduce the execution time?

Selenium seems the wrong tool for web scraping - though I appreciate YMMV, in particular if you need to simulate user interaction with the web site or there is some JavaScript limitation/requirement.

For scraping tasks without much interaction, I have had good results using the opensource Scrapy Python package for large-scale scrapying tasks. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database -- and it is really fast.

Your script would look something like this when implemented as a fully parallel Scrapy spider (note I did not test this, see documentation on selectors).

import scrapyclass BlogSpider(scrapy.Spider):    name = 'blogspider'    start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']    def parse(self, response):        for title in response.css('.summary .question-hyperlink'):            yield title.get('href')

To run put this into blogspider.py and run

$ scrapy runspider blogspider.py

See the Scrapy website for a complete tutorial.

Note that Scrapy also supports JavaScript through scrapy-splash, thanks to the pointer by @SIM. I didn't have any exposure with that so far so can't speak to this other than it looks well integrated with how Scrapy works.