Subtract corresponding lines
Assuming that both files are sorted on second column:
$ join -j2 -a1 -a2 -oauto -e0 file1 file2 | awk '{print $2 - $3, $1}'-2 1009-3 10107 10121 10138 1014
join
will join sorted files.-j2
will join one second column.-a1
will print records from file1 even it there is no corresponding row in file2.-a2
Same as -a1
but applied for file2.-oauto
is in this case the same as -o1.2,1.1,2.1
which will print the joined column, and then the remaining columns from file1 and file2.-e0
will insert 0
instead of an empty column. This works with -a1
and -a2
.
The output from join
is three columns like:
1009 3 51010 0 31012 7 01013 2 11014 8 0
Which is piped to awk, to subtract column three from column 2, and then reformatting.
$ awk 'NR==FNR { a[$2]=$1; next } { a[$2]-=$1 } END { for(i in a) print a[i],i }' file1 file27 10121 10138 1014-2 1009-3 1010
It reads the first file in memory so you should have enough memory available. If you don't have the memory, I would maybe sort -k2
the files first, then sort -m
(merge) them and continue with that output:
$ sort -m -k2 -k3 <(sed 's/$/ 1/' file1|sort -k2) <(sed 's/$/ 2/' file2|sort -k2) # | awk ...3 1009 15 1009 2 # previous $2 = current $2 -> subtract3 1010 2 # previous $2 =/= current and current $3=2 print -$37 1012 12 1013 1 # previous $2 =/= current and current $3=1 print prev $21 1013 28 1014 1
(I'm out of time for now, maybe I'll finish it later)
EDIT by Ed MortonHope you don't mind me adding what I was working on rather than posting my own extremely similar answer, feel free to modify or delete it:
$ cat tst.awk{ split(prev,p) }$2 == p[2] { print p[1] - $1, p[2] prev = "" next}p[2] != "" { print (p[3] == 1 ? p[1] : 0-p[1]), p[2]}{ prev = $0 }END { split(prev,p) print (p[3] == 1 ? p[1] : 0-p[1]), p[2]}$ sort -m -k2 <(sed 's/$/ 1/' file1) <(sed 's/$/ 2/' file2) | awk -f tst.awk-2 1009-3 10107 10121 10138 1014
Since the files are sorted¹, you can merge them line-by-line with the join
utility in coreutils
:
$ join -j2 -o auto -e 0 -a 1 -a 2 41144043-a 41144043-b1009 3 51010 0 31012 7 01013 2 11014 8 0
All those options are required:
-j2
says to join based on the second column of each file-o auto
says to make every row have the same format, beginning with the join key-e 0
says that missing values should be substituted with zero-a 1
and-a 2
include rows that are absent from one file or another- the filenames (I've used names based on the question number here)
Now we have a stream of output in that format, we can do the subtraction on each line. I used this GNU sed command to transform the above output into a dc
program:
sed -re 's/.*/c&-n[ ]np/e'
This takes the three values on each line and rearranges them into a dc
command for the subtraction, then executes it. For example, the first line becomes (with spaces added for clarity)
c 1009 3 5 -n [ ]n p
which subtracts 5 from 3, prints it, then prints a space, then prints 1009 and a newline, giving
-2 1009
as required.
We can then pipe all these lines into dc
, giving us the output file that we want:
$ join -o auto -j2 -e 0 -a 1 -a 2 41144043-a 41144043-b \> | sed -e 's/.*/c& -n[ ]np/' \> | dc-2 1009-3 10107 10121 10138 1014
¹ The sorting needs to be consistent with LC_COLLATE
locale setting. That's unlikely to be an issue if the fields are always numeric.
TL;DR
The full command is:
join -o auto -j2 -e 0 -a 1 -a 2 "$file1" "$file2" | sed -e 's/.*/c& -n[ ]np/' | dc
It works a line at a time, and starts only the three processes you see, so should be reasonably efficient in both memory and CPU.