extracting unique values between 2 sets/files

linux perl bash scripting command-line

$ awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file267

Explanation of how the code works:

If we're working on file1, track each line of text we see.
If we're working on file2, and have not seen the line text, then print it.

Explanation of details:

FNR is the current file's record number
NR is the current overall record number from all input files
FNR==NR is true only when we are reading file1
$0 is the current line of text
a[$0] is a hash with the key set to the current line of text
a[$0]++ tracks that we've seen the current line of text
!($0 in a) is true only when we have not seen the line text
Print the line of text if the above pattern returns true, this is the default awk behavior when no explicit action is given

linux perl bash scripting command-line

Using some lesser-known utilities:

sort file1 > file1.sortedsort file2 > file2.sortedcomm -1 -3 file1.sorted file2.sorted

This will output duplicates, so if there is 1 3 in file1, but 2 in file2, this will still output 1 3. If this is not what you want, pipe the output from sort through uniq before writing it to a file:

sort file1 | uniq > file1.sortedsort file2 | uniq > file2.sortedcomm -1 -3 file1.sorted file2.sorted

There are lots of utilities in the GNU coreutils package that allow for all sorts of text manipulations.

linux perl bash scripting command-line

I was wondering which of the following solutions was the "fastest" for "larger" files:

awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2 # awk1 by SiegeXawk 'FNR==NR{a[$0]++;next}!($0 in a)' file1 file2          # awk2 by ghostdog74comm -13 <(sort file1) <(sort file2)join -v 2 <(sort file1) <(sort file2)grep -v -F -x -f file1 file2

Results of my benchmarks in short:

Do not use grep -Fxf, it's much slower (2-4 times in my tests).
comm is slightly faster than join.
If file1 and file2 are already sorted, comm and join are much faster than awk1 + awk2. (Of course, they do not assume sorted files.)
awk1 + awk2, supposedly, use more RAM and less CPU. Real run times are lower for comm probably due to the fact that it uses more threads. CPU times are lower for awk1 + awk2.

For the sake of brevity I omit full details. However, I assume that anyone interested can contact me or just repeat the tests. Roughly, the setup was

# Debian Squeeze, Bash 4.1.5, LC_ALL=C, slow 4 core CPU$ wc file1 file2  321599   321599  8098710 file1  321603   321603  8098794 file2

Typical results of fastest runs

awk2: real 0m1.145s  user 0m1.088s  sys 0m0.056s  user+sys 1.144awk1: real 0m1.369s  user 0m1.324s  sys 0m0.044s  user+sys 1.368comm: real 0m0.980s  user 0m1.608s  sys 0m0.184s  user+sys 1.792join: real 0m1.080s  user 0m1.756s  sys 0m0.140s  user+sys 1.896grep: real 0m4.005s  user 0m3.844s  sys 0m0.160s  user+sys 4.004

BTW, for the awkies: It seems that a[$0]=1 is faster than a[$0]++, and (!($0 in a)) is faster than (!a[$0]). So, for an awk solution I suggest:

awk 'FNR==NR{a[$0]=1;next}!($0 in a)' file1 file2

CodeHunter

extracting unique values between 2 sets/files

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last