Left outer join on two files in unix Left outer join on two files in unix unix unix

Left outer join on two files in unix


It's join -t '|' file1 file2 -a1

Options used:

t: Delimiter.
a: Decides the file number from which the unpaired lines have to be printed.

join -t '|' file1 file2 -a2 would do a right outer join.

Sample Run

   [aman@aman test]$ cat f1      01|a|jack|d    02|b|ron|c    03|d|tom|e    [aman@aman test]$ cat f2    01|a|nemesis|f    02|b|brave|d    04|d|gorr|h    [aman@aman test]$ join -t '|'  f1 f2 -a1    01|a|jack|d|a|nemesis|f    02|b|ron|c|b|brave|d    03|d|tom|e


To do exactly what the question asks is a bit more complicated than previous answer and would require something like this:

sed 's/|/:/2' file1 | sort -t: >file1.tmpsed 's/|/:/2' file2 | sort -t: >file2.tmpjoin -t':' file1.tmp file2.tmp -a1 -e'|' -o'0,1.2,2.2' | tr ':' '|'

Unix join can only join on a single field AFAIK so you must use files that use a different delimiter to "join two files on two fields", in this case the first two fields. I'll use a colon :, however if : exists in any of the input you would need to use something else, a tab character for example might be a better choice for production use. I also re-sort the output on the new compound field, sort -t:, which for the example input files makes no difference but would for real world data. sed 's/|/:/2' replaces the second occurrence of pipe with colon on each line in file.

file1.tmp

01|a:jack|d02|b:ron|c03|d:tom|e

file2.tmp

01|a:nemesis|f02|b:brave|d04|d:gorr|h

Now we use join output filtered by tr with a few more advanced options:

  • -t':' specify the interim colon delimiter
  • -a1 left outer join
  • -e'|' specifies the replacement string for failed joins, basically the final output delimiter N-1 times where N is the number of pipe delimited fields joined to the right of the colon in file2.tmp. In this case N=2 so one pipe character.
  • -o'0,1.2,2.2' specifies the output format:
    • 0 join field
    • 1.2 field 2 of file1.tmp, i.e. everything right of colon
    • 2.2 field 2 of file2.tmp
  • tr ':' '|' Finally we translate the colons back to pipes for the final output.

The output now matches the question sample output exactly which the previous answer did not do:

01|a|jack|d|nemesis|f02|b|ron|c|brave|d03|d|tom|e||


I recently had this issue with a very simple input file , just one field, hence no considerations of delimiters.

cat file1 > k1cat file2 >> k1sort k1 | uniq -c | grep "^.*1 "  will give you lines that occur in only 1 file

This is a special case, it may not be applicable or comparable to the above techniques posted here, but putting out there in case its useful to someone, who's looking for left outer joins (i.e. unmatched cases only). Grepping for "^.*2 " will give you matched cases. In case you have a multi-field file (the more common case), but you only care about a single join field, you can use Awk to create a key-only file (for each file) and then process as above.