properly join two files based on 2 columns in common properly join two files based on 2 columns in common unix unix

properly join two files based on 2 columns in common


awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2

Look:

$ cat file12L      5753   331582L      8813   331582L      7885   331592L      1279   331592L      5095   33158$$ cat file22L      8813    0.6    1.22L      5762    0.4    0.52L      1279    0.5    0.9$$ awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file22L      8813    0.6    1.2 331582L      1279    0.5    0.9 33159$

If that's not what you want, please clarify and perhaps post some more representative sample input/output.

Commented version of the above code to provide requested explanation:

awk ' # START SCRIPT# IF the number of records read so far across all files is equal#    to the number of records read so far in the current file, a#    condition which can only be true for the first file read, THEN NR==FNR {   # populate array "a" such that the value indexed by the first   # 2 fields from this record in file1 is the value of the third   # field from the first file.   a[$1,$2]=$3   # Move on to the next record so we don't do any processing intended   # for records from the second file. This is like an "else" for the   # NR==FNR condition.   next} # END THEN# We only reach this part of the code if the above condition is false,# i.e. if the current record is from file2, not from file1.# IF the array index constructed from the first 2 fields of the current#    record exist in array a, as would occur if these same values existed#    in file1, THEN($1,$2) in a {   # print the current record from file2 followed by the value from file1   # that occurred at field 3 of the record that had the same values for   # field 1 and field 2 in file1 as the current record from file2.   print $0, a[$1,$2]} # END THEN' file1 file2 # END SCRIPT

Hope that helps.


If you like to join the files line by line then use this command:

join -o 1.2,1.3,2.4,2.5,1.4 <(cat -n file1) <(cat -n file2)

As you updated the question:

join -o 1.1,2.2,2.3,1.2 <(sed 's/[[:space:]]\+/@/' file1|sort) \    <(sed 's/[[:space:]]\+/@/' file2|sort)|sed 's/@/\t/'

First replace the first delimiter in each row with some non-space character and sort both input files. Then use join to make the actual join. Filter out its output to replace the non-space char with space.

This is the output from the files as in question:

xyz]$ join -o 1.1,2.2,2.3,1.2 <(sed 's/[[:space:]]\+/@/' file1|sort) \<(sed 's/[[:space:]]\+/@/' file2|sort)|sed 's/@/\t/'2L  25753 0.967741935483871 0.869565217391304 331582L  28813 0.181818181818182 0.692307692307692 331582L  31003 0.36 0.666666666666667 331582L  31077 0.611111111111111 0.931034482758621 331612L  31279 0.75 1 331613L  32124 0.558823529411765 0.857142857142857 453393L  33256 0.769230769230769 0.90625 45339


You can use the join command but you need to create a single join field in each data table. Assuming that you do have values other that 2L in column 1, then this code should work regardless of the sorted or unsorted nature of the two input files:

tmp=${TMPDIR:-/tmp}/tmp.$$trap "rm -f $tmp.?; exit 1" 0 1 2 3 13 15awk '{print $1 ":" $2, $0}' file1 | sort > $tmp.1awk '{print $1 ":" $2, $0}' file2 | sort > $tmp.2join -o 2.2,2.3,2.4,2.5,1.4 $tmp.1 $tmp.2rm -f $tmp.?trap 0

If you have bash and 'process substitution', or if you know that the data is already sorted appropriately, you can simplify the processing.


I'm not entirely sure why your code wasn't working, but I'd probably be using a[$1,$2] for the subscripts; it will give you less trouble if some of your column 1 values are pure numeric and can therefore be confused when you concatenate columns 1 and 2. That's why the 'key creation' awk scripts used a colon between the fields.


With revised data files as shown:

file1

2L      5753   331582L      8813   331582L      7885   331582L      7885   331592L      1279   331582L      5095   331582L      3256   331582L      5372   331582L      7088   331612L      5762   33161

file2

2L      5095    0.666666666666667       12L      5372    0.5     0.9259259259259262L      5762    0.434782608695652       0.5806451612903232L      5904    0.571428571428571       0.8695652173913042L      5974    0.434782608695652       0.6944444444444442L      6353    0.785714285714286       0.842L      7088    0.590909090909091       0.7333333333333332L      7885    0.714285714285714       0.8648648648648652L      7902    0.642857142857143       0.8108108108108112L      8263    0.833333333333333       0.787878787878788

(Unchanged from the question.)

Output

2L 5095 0.666666666666667 1 331582L 5372 0.5 0.925925925925926 331582L 5762 0.434782608695652 0.580645161290323 331612L 7088 0.590909090909091 0.733333333333333 331612L 7885 0.714285714285714 0.864864864864865 331582L 7885 0.714285714285714 0.864864864864865 33159