Efficient way to find to compare lines of two huge but sorted files
Assuming - as you state - that files are sorted (I did not test it, but it should work)
FILE1_POS = 0FILE2_POS = 1MAX_POS = 2missing = []get_rec_no = lambda l, f=FILE1_POS: int(l.split(None, MAX_POS)[f])with open('File1') as source, open('File2') as trgt: trgt = iter(trgt) for line in source: src_rec_no = get_rec_no(line) try: trgt_rec_no = get_rec_no(trgt.next(), FILE2_POS) while trgt_rec_no < src_rec_no: missing.append(src_rec_no) trgt_rec_no = get_rec_no(trgt.next(), FILE2_POS) except StopIteration: break for line in source: missing.append(get_rec_no(line))
EDIT:
I changed by you request
If you are having problems because of performance, you may want to look into using something like Hadoop/MapReduce to split the files into a number of smaller files, and then you can run each subprocess on different processors to achieve speedup.
As a simple example, of splitting the file in two, you could have a subfile which contained all the keys between [A-M], [M-Z]. In this way if you know the key from one file, you know which subfile to search for its possible match. In theory, you will have cut the search time in half if you split the files in two (however, not exactly in half because there is overhead involved).
Basically, the programming would involved writing a mapper.py and a reducer.py. If you're not familiar with Hadoop/MapReduce, there are many good tutorials on setting up a MapReduce job in Python, but the one I would suggest is the Udacity course called "Intro to Hadoop and MapReduce" which solves some similar problems using Python and MapReduce.
Also, you may want to consider editing your tags to include Hadoop MapReduce, in which case you could get some more specific help on writing the mapper.py and reducer.py.
The fact that the files are of a different length doesn't exclude an awk solution. Taking your two inputs:
[ damien $] cat file1cat: file1: No such file or directory [ damien $] cat file1.txt1 robert youh xpla@ioaio.fr2 patrick yuad qqqq@ioaio.fr3 bob fsgfq ddd@ioaio.fr8 tim qqjh hjahj@uayua.com9 john ajkajk rtaeraer@auiaui.com [ damien $] cat file2.txt1 baby france paris2 father usa detroit3 mother uk london4 baby italy milan [ damien $]
consider the following script:
[ damien $] cat n_join.awk#!/usr/bin/gawk -fNR==FNR{ # file2[$1]=$0; file2[$1]=1;}NR!=FNR{ if(!($1 in file2)){# print current record if not in file2 print ; }else{# if $1 from file1 has been found.# if it's really unique in file1, it# can be deleted from file2. Otherwise# comment this line: delete file2[$1]; }} [ damien $]
which gives the output:
[ damien $] chmod +x n_join.awk [ damien $] ./n_join.awk file2.txt file1.txt8 tim qqjh hjahj@uayua.com9 john ajkajk rtaeraer@auiaui.com [ damien $]
note that file2.txt must be passed in first. I have no ideaa if this will work on a file that's 2 million lines long, but would be interested to know if you have the time to try it. :)
If you can make the files available (unlikely), I'll try it myself... :D
EDIT: I understand that you've accepted your answer and have probably moved on with your life, however, I would like to add in some additional information.
If I create two large files of the type that you specify: file1.bit.txt containing 5 million records:
[ damien $] seq 1 1 5000000 > file1.big.txt [ damien $] sed -i 's|$| bob fsgfq ddd@ioaio.fr|' file1.big.txt [ damien $] head file1.big.txt1 bob fsgfq ddd@ioaio.fr2 bob fsgfq ddd@ioaio.fr3 bob fsgfq ddd@ioaio.fr4 bob fsgfq ddd@ioaio.fr5 bob fsgfq ddd@ioaio.fr6 bob fsgfq ddd@ioaio.fr7 bob fsgfq ddd@ioaio.fr8 bob fsgfq ddd@ioaio.fr9 bob fsgfq ddd@ioaio.fr10 bob fsgfq ddd@ioaio.fr [ damien $] tail file1.big.txt4999991 bob fsgfq ddd@ioaio.fr4999992 bob fsgfq ddd@ioaio.fr4999993 bob fsgfq ddd@ioaio.fr4999994 bob fsgfq ddd@ioaio.fr4999995 bob fsgfq ddd@ioaio.fr4999996 bob fsgfq ddd@ioaio.fr4999997 bob fsgfq ddd@ioaio.fr4999998 bob fsgfq ddd@ioaio.fr4999999 bob fsgfq ddd@ioaio.fr5000000 bob fsgfq ddd@ioaio.fr [ damien $] [ damien $] [ damien $] [ damien $]
and
[ damien $] [ damien $] seq 2 2 5000000 > file2.big.txt [ damien $] sed -i 's|$| baby france paris|' file2.big.txt [ damien $] head file2.big.txt2 baby france paris4 baby france paris6 baby france paris8 baby france paris10 baby france paris12 baby france paris14 baby france paris16 baby france paris18 baby france paris20 baby france paris [ damien $] tail file2.big.txt4999982 baby france paris4999984 baby france paris4999986 baby france paris4999988 baby france paris4999990 baby france paris4999992 baby france paris4999994 baby france paris4999996 baby france paris4999998 baby france paris5000000 baby france paris [ damien $]
with only even numbered keys. Running my script gives:
[ damien $] [ damien $] time ./n_join.awk file2.big.txt file1.big.txt > output.bigreal 0m4.154suser 0m3.893ssys 0m0.207s [ damien $] [ damien $] head output.big1 bob fsgfq ddd@ioaio.fr3 bob fsgfq ddd@ioaio.fr5 bob fsgfq ddd@ioaio.fr7 bob fsgfq ddd@ioaio.fr9 bob fsgfq ddd@ioaio.fr11 bob fsgfq ddd@ioaio.fr13 bob fsgfq ddd@ioaio.fr15 bob fsgfq ddd@ioaio.fr17 bob fsgfq ddd@ioaio.fr19 bob fsgfq ddd@ioaio.fr [ damien $] tail output.big4999981 bob fsgfq ddd@ioaio.fr4999983 bob fsgfq ddd@ioaio.fr4999985 bob fsgfq ddd@ioaio.fr4999987 bob fsgfq ddd@ioaio.fr4999989 bob fsgfq ddd@ioaio.fr4999991 bob fsgfq ddd@ioaio.fr4999993 bob fsgfq ddd@ioaio.fr4999995 bob fsgfq ddd@ioaio.fr4999997 bob fsgfq ddd@ioaio.fr4999999 bob fsgfq ddd@ioaio.fr [ damien $] wc -l output.big2500000 output.big [ damien $]
where the whole thing completes in about 4 seconds, which doesn't seem at all prohibitive. Either there's a big difference in the data sets or your script operated significantly differently to mine. Maybe this is useful to somebody. :/
Ps. I have a i7-3770 CPU @ 3.40GHz according to /proc/cpuinfo