Python selenium multiprocessing

python python-3.x selenium web-scraping multiprocessing

how can I reduce the execution time using selenium when it is made to run using multiprocessing

A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread:

(... skipped for brevity ...)threadLocal = threading.local()def get_driver():  driver = getattr(threadLocal, 'driver', None)  if driver is None:    chromeOptions = webdriver.ChromeOptions()    chromeOptions.add_argument("--headless")    driver = webdriver.Chrome(chrome_options=chromeOptions)    setattr(threadLocal, 'driver', driver)  return driverdef get_title(url):  driver = get_driver()  driver.get(url)  (...)(...)

On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement. To test yourself, download the full script.

Note: ThreadPool uses threads, which are constrained by the Python GIL. That's ok if for the most part the task is I/O bound. Depending on the post-processing you do with the scraped results, you may want to use a multiprocessing.Pool instead. This launches parallel processes which as a group are not constrained by the GIL. The rest of the code stays the same.

python python-3.x selenium web-scraping multiprocessing

The one potential problem I see with the clever one-driver-per-thread answer is that it omits any mechanism for "quitting" the drivers and thus leaving the possibility of processes hanging around. I would make the following changes:

Use instead class Driver that will crate the driver instance and store it on the thread-local storage but also have a destructor that will quit the driver when the thread-local storage is deleted:

class Driver:    def __init__(self):        options = webdriver.ChromeOptions()        options.add_argument("--headless")        self.driver = webdriver.Chrome(options=options)    def __del__(self):        self.driver.quit() # clean up driver when we are cleaned up        #print('The driver has been "quitted".')

create_driver now becomes:

threadLocal = threading.local()def create_driver():    the_driver = getattr(threadLocal, 'the_driver', None)    if the_driver is None:        the_driver = Driver()        setattr(threadLocal, 'the_driver', the_driver)    return the_driver.driver

Finally, after you have no further use for the ThreadPool instance but before it is terminated, add the following lines to delete the thread-local storage and force the Driver instances' destructors to be called (hopefully):

del threadLocalimport gcgc.collect() # a little extra insurance

python python-3.x selenium web-scraping multiprocessing

My question: how can I reduce the execution time?

Selenium seems the wrong tool for web scraping - though I appreciate YMMV, in particular if you need to simulate user interaction with the web site or there is some JavaScript limitation/requirement.

For scraping tasks without much interaction, I have had good results using the opensource Scrapy Python package for large-scale scrapying tasks. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database -- and it is really fast.

Your script would look something like this when implemented as a fully parallel Scrapy spider (note I did not test this, see documentation on selectors).

import scrapyclass BlogSpider(scrapy.Spider):    name = 'blogspider'    start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']    def parse(self, response):        for title in response.css('.summary .question-hyperlink'):            yield title.get('href')

To run put this into blogspider.py and run

$ scrapy runspider blogspider.py

See the Scrapy website for a complete tutorial.

Note that Scrapy also supports JavaScript through scrapy-splash, thanks to the pointer by @SIM. I didn't have any exposure with that so far so can't speak to this other than it looks well integrated with how Scrapy works.

CodeHunter

Python selenium multiprocessing

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last