extracting unique values between 2 sets/files extracting unique values between 2 sets/files bash bash

extracting unique values between 2 sets/files


$ awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file267

Explanation of how the code works:

  • If we're working on file1, track each line of text we see.
  • If we're working on file2, and have not seen the line text, then print it.

Explanation of details:

  • FNR is the current file's record number
  • NR is the current overall record number from all input files
  • FNR==NR is true only when we are reading file1
  • $0 is the current line of text
  • a[$0] is a hash with the key set to the current line of text
  • a[$0]++ tracks that we've seen the current line of text
  • !($0 in a) is true only when we have not seen the line text
  • Print the line of text if the above pattern returns true, this is the default awk behavior when no explicit action is given


Using some lesser-known utilities:

sort file1 > file1.sortedsort file2 > file2.sortedcomm -1 -3 file1.sorted file2.sorted

This will output duplicates, so if there is 1 3 in file1, but 2 in file2, this will still output 1 3. If this is not what you want, pipe the output from sort through uniq before writing it to a file:

sort file1 | uniq > file1.sortedsort file2 | uniq > file2.sortedcomm -1 -3 file1.sorted file2.sorted

There are lots of utilities in the GNU coreutils package that allow for all sorts of text manipulations.


I was wondering which of the following solutions was the "fastest" for "larger" files:

awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2 # awk1 by SiegeXawk 'FNR==NR{a[$0]++;next}!($0 in a)' file1 file2          # awk2 by ghostdog74comm -13 <(sort file1) <(sort file2)join -v 2 <(sort file1) <(sort file2)grep -v -F -x -f file1 file2

Results of my benchmarks in short:

  • Do not use grep -Fxf, it's much slower (2-4 times in my tests).
  • comm is slightly faster than join.
  • If file1 and file2 are already sorted, comm and join are much faster than awk1 + awk2. (Of course, they do not assume sorted files.)
  • awk1 + awk2, supposedly, use more RAM and less CPU. Real run times are lower for comm probably due to the fact that it uses more threads. CPU times are lower for awk1 + awk2.

For the sake of brevity I omit full details. However, I assume that anyone interested can contact me or just repeat the tests. Roughly, the setup was

# Debian Squeeze, Bash 4.1.5, LC_ALL=C, slow 4 core CPU$ wc file1 file2  321599   321599  8098710 file1  321603   321603  8098794 file2

Typical results of fastest runs

awk2: real 0m1.145s  user 0m1.088s  sys 0m0.056s  user+sys 1.144awk1: real 0m1.369s  user 0m1.324s  sys 0m0.044s  user+sys 1.368comm: real 0m0.980s  user 0m1.608s  sys 0m0.184s  user+sys 1.792join: real 0m1.080s  user 0m1.756s  sys 0m0.140s  user+sys 1.896grep: real 0m4.005s  user 0m3.844s  sys 0m0.160s  user+sys 4.004

BTW, for the awkies: It seems that a[$0]=1 is faster than a[$0]++, and (!($0 in a)) is faster than (!a[$0]). So, for an awk solution I suggest:

awk 'FNR==NR{a[$0]=1;next}!($0 in a)' file1 file2