Read txt file with multi-threaded in python

python multithreading text-files

I agree with @aix, multiprocessing is definitely the way to go. Regardless you will be i/o bound -- you can only read so fast, no matter how many parallel processes you have running. But there can easily be some speedup.

Consider the following (input/ is a directory that contains several .txt files from Project Gutenberg).

import os.pathfrom multiprocessing import Poolimport sysimport timedef process_file(name):    ''' Process one file: count number of lines and words '''    linecount=0    wordcount=0    with open(name, 'r') as inp:        for line in inp:            linecount+=1            wordcount+=len(line.split(' '))    return name, linecount, wordcountdef process_files_parallel(arg, dirname, names):    ''' Process each file in parallel via Poll.map() '''    pool=Pool()    results=pool.map(process_file, [os.path.join(dirname, name) for name in names])def process_files(arg, dirname, names):    ''' Process each file in via map() '''    results=map(process_file, [os.path.join(dirname, name) for name in names])if __name__ == '__main__':    start=time.time()    os.path.walk('input/', process_files, None)    print "process_files()", time.time()-start    start=time.time()    os.path.walk('input/', process_files_parallel, None)    print "process_files_parallel()", time.time()-start

When I run this on my dual core machine there is a noticeable (but not 2x) speedup:

$ python process_files.pyprocess_files() 1.71218085289process_files_parallel() 1.28905105591

If the files are small enough to fit in memory, and you have lots of processing to be done that isn't i/o bound, then you should see even better improvement.

python multithreading text-files

Yes, it should be possible to do this in a parallel manner.

However, in Python it's hard to achieve parallelism with multiple threads. For this reason multiprocessing is the better default choice for doing things in parallel.

It is hard to say what kind of speedup you can expect to achieve. It depends on what fraction of the workload it will be possible to do in parallel (the more the better), and what fraction will have to be done serially (the less the better).

CodeHunter

Read txt file with multi-threaded in python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last