Fastest way to concatenate multiple files column wise - Python Fastest way to concatenate multiple files column wise - Python shell shell

Fastest way to concatenate multiple files column wise - Python


From all four methods I'd take the second. But you have to take care of small details in the implementation. (with a few improvements it takes 0.002 seconds meanwhile the original implementation takes about 6 seconds; the file I was working was 1M rows; but there should not be too much difference if the file is 1K times bigger as we are not using almost memory).

Changes from the original implementation:

  • Use iterators if possible, otherwise memory consumption will be penalized and you have to handle the whole file at once.(mainly if you are using python 2, instead of using zip use itertools.izip)
  • When you are concatenating strings, use "%s%s".format() or similar; otherwise you generate one new string instance each time.
  • There's no need of writing line by line inside the for. You can use an iterator inside the write.
  • Small buffers are very interesting but if we are using iterators the difference is very small, but if we try to fetch all data at once (so, for example, we put f1.readlines(1024*1000), it's much slower).

Example:

def concat_iter(file1, file2, output):    with open(output, 'w', 1024) as fo, \        open(file1, 'r') as f1, \        open(file2, 'r') as f2:        fo.write("".join("{}\t{}".format(l1, l2)            for l1, l2 in izip(f1.readlines(1024),                               f2.readlines(1024))))

Profiler original solution.

We see that the biggest problem is in write and zip (mainly for not using iterators and having to handle/ process all file in memory).

~/personal/python-algorithms/files$ python -m cProfile sol_original.py 10000006 function calls in 5.208 secondsOrdered by: standard namencalls  tottime  percall  cumtime  percall filename:lineno(function)    1    0.000    0.000    5.208    5.208 sol_original.py:1(<module>)    1    2.422    2.422    5.208    5.208 sol_original.py:1(concat_files_zip)    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}    **9999999    1.713    0.000    1.713    0.000 {method 'write' of 'file' objects}**    3    0.000    0.000    0.000    0.000 {open}    1    1.072    1.072    1.072    1.072 {zip}

Profiler:

~/personal/python-algorithms/files$ python -m cProfile sol1.py      3731 function calls in 0.002 secondsOrdered by: standard name   ncalls  tottime  percall  cumtime  percall filename:lineno(function)    1    0.000    0.000    0.002    0.002 sol1.py:1(<module>)    1    0.000    0.000    0.002    0.002 sol1.py:3(concat_iter6) 1861    0.001    0.000    0.001    0.000 sol1.py:5(<genexpr>)    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects} 1860    0.001    0.000    0.001    0.000 {method 'format' of 'str' objects}    1    0.000    0.000    0.002    0.002 {method 'join' of 'str' objects}    2    0.000    0.000    0.000    0.000 {method 'readlines' of 'file' objects}    **1    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}**    3    0.000    0.000    0.000    0.000 {open}

And in python 3 is even faster, because iterators are built-in and we dont need to import any library.

~/personal/python-algorithms/files$ python3.5 -m cProfile sol2.py 843 function calls (842 primitive calls) in 0.001 seconds[...]

And also it's very nice to see memory consumption and File System accesses that confirms what we have said before:

$ /usr/bin/time -v python sol1.pyCommand being timed: "python sol1.py"User time (seconds): 0.01[...]Maximum resident set size (kbytes): 7120Average resident set size (kbytes): 0Major (requiring I/O) page faults: 0Minor (reclaiming a frame) page faults: 914[...]File system outputs: 40Socket messages sent: 0Socket messages received: 0$ /usr/bin/time -v python sol_original.py Command being timed: "python sol_original.py"User time (seconds): 5.64[...]Maximum resident set size (kbytes): 1752852Average resident set size (kbytes): 0Major (requiring I/O) page faults: 0Minor (reclaiming a frame) page faults: 427697[...]File system inputs: 0File system outputs: 327696


You can replace the for loop with writelines by passing a genexp to it and replace zip with izip from itertools in method 2. This may come close to paste or surpass it.

with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2, open(output, 'wb') as fout:    fout.writelines(b"{}\t{}".format(*line) for line in izip(fin1, fin2))

If you don't want to embed \t in the format string, you can use repeat from itertools;

    fout.writelines(b"{}{}{}".format(*line) for line in izip(fin1, repeat(b'\t'), fin2))

If the files are of same length, you can do away with izip.

with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2, open(output, 'wb') as fout:    fout.writelines(b"{}\t{}".format(line, next(fin2)) for line in fin1)


You can try to test your function with timeit. This doc could be helpful.

Or the same magic function %%timeit in Jupyter notebook. You just need to write %%timeit func(data) and you will get a response with the assessment of your function. This paper could help you with it.