Python selenium multiprocessing
how can I reduce the execution time using selenium when it is made to run using multiprocessing
A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread:
(... skipped for brevity ...)threadLocal = threading.local()def get_driver(): driver = getattr(threadLocal, 'driver', None) if driver is None: chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) setattr(threadLocal, 'driver', driver) return driverdef get_title(url): driver = get_driver() driver.get(url) (...)(...)
On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement. To test yourself, download the full script.
Note: ThreadPool
uses threads, which are constrained by the Python GIL. That's ok if for the most part the task is I/O bound. Depending on the post-processing you do with the scraped results, you may want to use a multiprocessing.Pool
instead. This launches parallel processes which as a group are not constrained by the GIL. The rest of the code stays the same.
The one potential problem I see with the clever one-driver-per-thread answer is that it omits any mechanism for "quitting" the drivers and thus leaving the possibility of processes hanging around. I would make the following changes:
- Use instead class
Driver
that will crate the driver instance and store it on the thread-local storage but also have a destructor that willquit
the driver when the thread-local storage is deleted:
class Driver: def __init__(self): options = webdriver.ChromeOptions() options.add_argument("--headless") self.driver = webdriver.Chrome(options=options) def __del__(self): self.driver.quit() # clean up driver when we are cleaned up #print('The driver has been "quitted".')
create_driver
now becomes:
threadLocal = threading.local()def create_driver(): the_driver = getattr(threadLocal, 'the_driver', None) if the_driver is None: the_driver = Driver() setattr(threadLocal, 'the_driver', the_driver) return the_driver.driver
- Finally, after you have no further use for the
ThreadPool
instance but before it is terminated, add the following lines to delete the thread-local storage and force theDriver
instances' destructors to be called (hopefully):
del threadLocalimport gcgc.collect() # a little extra insurance
My question: how can I reduce the execution time?
Selenium seems the wrong tool for web scraping - though I appreciate YMMV, in particular if you need to simulate user interaction with the web site or there is some JavaScript limitation/requirement.
For scraping tasks without much interaction, I have had good results using the opensource Scrapy Python package for large-scale scrapying tasks. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database -- and it is really fast.
Your script would look something like this when implemented as a fully parallel Scrapy spider (note I did not test this, see documentation on selectors).
import scrapyclass BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping'] def parse(self, response): for title in response.css('.summary .question-hyperlink'): yield title.get('href')
To run put this into blogspider.py
and run
$ scrapy runspider blogspider.py
See the Scrapy website for a complete tutorial.
Note that Scrapy also supports JavaScript through scrapy-splash, thanks to the pointer by @SIM. I didn't have any exposure with that so far so can't speak to this other than it looks well integrated with how Scrapy works.