How to extract one column from multiple files, and paste those columns into one file? How to extract one column from multiple files, and paste those columns into one file? shell shell

How to extract one column from multiple files, and paste those columns into one file?


Here's one way using awk and a sorted glob of files:

awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") $5 } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v *)

Results:

1 8 a2 9 b3 10 c4 11 d5 12 e6 13 f7 14 g

Explanation:

  • For each line of input of each input file:

    • Add the files line number to an array with a value of column 5.

    • (a[FNR] ? a[FNR] FS : "") is a ternary operation, which is set up to build up the arrays value as a record. It simply asks if the files line number is already in the array. If so, add the arrays value followed by the default file separator before adding the fifth column. Else, if the line number is not in the array, don't prepend anything, just let it equal the fifth column.

  • At the end of the script:

    • Use a C-style loop to iterate through the array, printing each of the arrays values.


For only ~4000 files, you should be able to do:

 find . -name sample_problem*_part*.txt | xargs paste

If find is giving names in the wrong order, pipe it to sort:

 find . -name sample_problem*_part*.txt | sort ... | xargs paste


# print filenames in sorted orderfind -name sample\*.txt | sort |# extract 5-th column from each file and print it on a single linexargs -n1 -I{} sh -c '{ cut -s -d " " -f 5 $0 | tr "\n" " "; echo; }' {} |# transposepython transpose.py ?

where transpose.py:

#!/usr/bin/env python"""Write lines from stdin as columns to stdout."""import sysfrom itertools import izip_longestmissing_value = sys.argv[1] if len(sys.argv) > 1 else '-'for row in izip_longest(*[column.split() for column in sys.stdin],                         fillvalue=missing_value):    print " ".join(row)

Output

1 8 a2 9 b3 10 c4 11 d5 ? e6 ? f? ? g

Assuming the first and second files have less lines than the third one (missing values are replaced by '?').