Left outer join on two files in unix
It's join -t '|' file1 file2 -a1
Options used:
t: Delimiter.
a: Decides the file number from which the unpaired lines have to be printed.
join -t '|' file1 file2 -a2
would do a right outer join.
Sample Run
[aman@aman test]$ cat f1 01|a|jack|d 02|b|ron|c 03|d|tom|e [aman@aman test]$ cat f2 01|a|nemesis|f 02|b|brave|d 04|d|gorr|h [aman@aman test]$ join -t '|' f1 f2 -a1 01|a|jack|d|a|nemesis|f 02|b|ron|c|b|brave|d 03|d|tom|e
To do exactly what the question asks is a bit more complicated than previous answer and would require something like this:
sed 's/|/:/2' file1 | sort -t: >file1.tmpsed 's/|/:/2' file2 | sort -t: >file2.tmpjoin -t':' file1.tmp file2.tmp -a1 -e'|' -o'0,1.2,2.2' | tr ':' '|'
Unix join can only join on a single field AFAIK so you must use files that use a different delimiter to "join two files on two fields", in this case the first two fields. I'll use a colon :, however if : exists in any of the input you would need to use something else, a tab character for example might be a better choice for production use. I also re-sort the output on the new compound field, sort -t:
, which for the example input files makes no difference but would for real world data. sed 's/|/:/2'
replaces the second occurrence of pipe with colon on each line in file.
file1.tmp
01|a:jack|d02|b:ron|c03|d:tom|e
file2.tmp
01|a:nemesis|f02|b:brave|d04|d:gorr|h
Now we use join
output filtered by tr
with a few more advanced options:
-t':'
specify the interim colon delimiter-a1
left outer join-e'|'
specifies the replacement string for failed joins, basically the final output delimiter N-1 times where N is the number of pipe delimited fields joined to the right of the colon in file2.tmp. In this case N=2 so one pipe character.-o'0,1.2,2.2'
specifies the output format:0
join field1.2
field 2 of file1.tmp, i.e. everything right of colon2.2
field 2 of file2.tmp
tr ':' '|'
Finally we translate the colons back to pipes for the final output.
The output now matches the question sample output exactly which the previous answer did not do:
01|a|jack|d|nemesis|f02|b|ron|c|brave|d03|d|tom|e||
I recently had this issue with a very simple input file , just one field, hence no considerations of delimiters.
cat file1 > k1cat file2 >> k1sort k1 | uniq -c | grep "^.*1 " will give you lines that occur in only 1 file
This is a special case, it may not be applicable or comparable to the above techniques posted here, but putting out there in case its useful to someone, who's looking for left outer joins (i.e. unmatched cases only). Grepping for "^.*2 " will give you matched cases. In case you have a multi-field file (the more common case), but you only care about a single join field, you can use Awk to create a key-only file (for each file) and then process as above.