Subtract corresponding lines Subtract corresponding lines shell shell

Subtract corresponding lines


Assuming that both files are sorted on second column:

$ join -j2 -a1 -a2 -oauto -e0 file1 file2 | awk '{print $2 - $3, $1}'-2 1009-3 10107 10121 10138 1014

join will join sorted files.
-j2 will join one second column.
-a1 will print records from file1 even it there is no corresponding row in file2.
-a2 Same as -a1 but applied for file2.
-oauto is in this case the same as -o1.2,1.1,2.1 which will print the joined column, and then the remaining columns from file1 and file2.
-e0 will insert 0 instead of an empty column. This works with -a1 and -a2.

The output from join is three columns like:

1009 3 51010 0 31012 7 01013 2 11014 8 0

Which is piped to awk, to subtract column three from column 2, and then reformatting.


$ awk 'NR==FNR { a[$2]=$1; next }               { a[$2]-=$1 }           END { for(i in a) print a[i],i }' file1 file27 10121 10138 1014-2 1009-3 1010

It reads the first file in memory so you should have enough memory available. If you don't have the memory, I would maybe sort -k2 the files first, then sort -m (merge) them and continue with that output:

$ sort -m -k2 -k3 <(sed 's/$/ 1/' file1|sort -k2) <(sed 's/$/ 2/' file2|sort -k2) # | awk ...3 1009 15 1009 2  # previous $2 = current $2 -> subtract3 1010 2  # previous $2 =/= current and current $3=2 print -$37 1012 12 1013 1  # previous $2 =/= current and current $3=1 print prev $21 1013 28 1014 1

(I'm out of time for now, maybe I'll finish it later)

EDIT by Ed MortonHope you don't mind me adding what I was working on rather than posting my own extremely similar answer, feel free to modify or delete it:

$ cat tst.awk{ split(prev,p) }$2 == p[2] {    print p[1] - $1, p[2]    prev = ""    next}p[2] != "" {    print (p[3] == 1 ? p[1] : 0-p[1]), p[2]}{ prev = $0 }END {    split(prev,p)    print (p[3] == 1 ? p[1] : 0-p[1]), p[2]}$ sort -m -k2 <(sed 's/$/ 1/' file1) <(sed 's/$/ 2/' file2) | awk -f tst.awk-2 1009-3 10107 10121 10138 1014


Since the files are sorted¹, you can merge them line-by-line with the join utility in coreutils:

$ join -j2 -o auto -e 0 -a 1 -a 2 41144043-a 41144043-b1009 3 51010 0 31012 7 01013 2 11014 8 0

All those options are required:

  • -j2 says to join based on the second column of each file
  • -o auto says to make every row have the same format, beginning with the join key
  • -e 0 says that missing values should be substituted with zero
  • -a 1 and -a 2 include rows that are absent from one file or another
  • the filenames (I've used names based on the question number here)

Now we have a stream of output in that format, we can do the subtraction on each line. I used this GNU sed command to transform the above output into a dc program:

sed -re 's/.*/c&-n[ ]np/e'

This takes the three values on each line and rearranges them into a dc command for the subtraction, then executes it. For example, the first line becomes (with spaces added for clarity)

c 1009 3 5 -n [ ]n p

which subtracts 5 from 3, prints it, then prints a space, then prints 1009 and a newline, giving

-2 1009

as required.

We can then pipe all these lines into dc, giving us the output file that we want:

$ join -o auto -j2 -e 0 -a 1 -a 2 41144043-a 41144043-b \>   | sed -e 's/.*/c& -n[ ]np/' \>   | dc-2 1009-3 10107 10121 10138 1014

¹ The sorting needs to be consistent with LC_COLLATE locale setting. That's unlikely to be an issue if the fields are always numeric.


TL;DR

The full command is:

join -o auto -j2 -e 0 -a 1 -a 2 "$file1" "$file2" | sed -e 's/.*/c& -n[ ]np/' | dc

It works a line at a time, and starts only the three processes you see, so should be reasonably efficient in both memory and CPU.