Multithreading in Python/BeautifulSoup scraping doesn't speed up at all Multithreading in Python/BeautifulSoup scraping doesn't speed up at all multithreading multithreading

Multithreading in Python/BeautifulSoup scraping doesn't speed up at all


You're not parallelizing this properly. What you actually want to do is have the work being done inside your for loop happen concurrently across many workers. Right now you're moving all the work into one background thread, which does the whole thing synchronously. That's not going to improve performance at all (it will just slightly hurt it, actually).

Here's an example that uses a ThreadPool to parallelize the network operation and parsing. It's not safe to try to write to the csv file across many threads at once, so instead we return the data that would have been written back to the parent, and have the parent write all the results to the file at the end.

import urllib2import csvfrom bs4 import BeautifulSoupfrom multiprocessing.dummy import Pool  # This is a thread-based Poolfrom multiprocessing import cpu_countdef crawlToCSV(URLrecord):    OpenSomeSiteURL = urllib2.urlopen(URLrecord)    Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")    OpenSomeSiteURL.close()    tbodyTags = Soup_SomeSite.find("tbody")    trTags = tbodyTags.find_all("tr", class_="result-item ")    placeHolder = []    for trTag in trTags:        tdTags = trTag.find("td", class_="result-value")        tdTags_string = tdTags.string        placeHolder.append(tdTags_string)    return placeHolderif __name__ == "__main__":    fileName = "SomeSiteValidURLs.csv"    pool = Pool(cpu_count() * 2)  # Creates a Pool with cpu_count * 2 threads.    with open(fileName, "rb") as f:        results = pool.map(crawlToCSV, f)  # results is a list of all the placeHolder lists returned from each call to crawlToCSV    with open("Output.csv", "ab") as f:        writeFile = csv.writer(f)        for result in results:            writeFile.writerow(result)

Note that in Python, threads only actually speed up I/O operations - because of the GIL, CPU-bound operations (like the parsing/searching BeautifulSoup is doing) can't actually be done in parallel via threads, because only one thread can do CPU-based operations at a time. So you still may not see the speed up you were hoping for with this approach. When you need to speed up CPU-bound operations in Python, you need to use multiple processes instead of threads. Luckily, you can easily see how this script performs with multiple processes instead of multiple threads; just change from multiprocessing.dummy import Pool to from multiprocessing import Pool. No other changes are required.

Edit:

If you need to scale this up to a file with 10,000,000 lines, you're going to need to adjust this code a bit - Pool.map converts the iterable you pass into it to a list prior to sending it off to your workers, which obviously isn't going to work very well with a 10,000,000 entry list; having that whole thing in memory is probably going to bog down your system. Same issue with storing all the results in a list. Instead, you should use Pool.imap:

imap(func, iterable[, chunksize])

A lazier version of map().

The chunksize argument is the same as the one used by the map() method. For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.

if __name__ == "__main__":    fileName = "SomeSiteValidURLs.csv"    FILE_LINES = 10000000    NUM_WORKERS = cpu_count() * 2    chunksize = FILE_LINES // NUM_WORKERS * 4   # Try to get a good chunksize. You're probably going to have to tweak this, though. Try smaller and lower values and see how performance changes.    pool = Pool(NUM_WORKERS)    with open(fileName, "rb") as f:        result_iter = pool.imap(crawlToCSV, f)    with open("Output.csv", "ab") as f:        writeFile = csv.writer(f)        for result in result_iter:  # lazily iterate over results.            writeFile.writerow(result)

With imap, we never put the all of f into memory at once, nor do we store all the results in memory at once. The most we ever have in memory is chunksize lines of f, which should be more manageable.