An efficient way to transpose a file in Bash An efficient way to transpose a file in Bash unix unix

An efficient way to transpose a file in Bash


awk '{     for (i=1; i<=NF; i++)  {        a[NR,i] = $i    }}NF>p { p = NF }END {        for(j=1; j<=p; j++) {        str=a[1,j]        for(i=2; i<=NR; i++){            str=str" "a[i,j];        }        print str    }}' file

output

$ more file0 1 23 4 56 7 89 10 11$ ./shell.sh0 3 6 91 4 7 102 5 8 11

Performance against Perl solution by Jonathan on a 10000 lines file

$ head -5 file1 0 1 22 3 4 53 6 7 84 9 10 111 0 1 2$  wc -l < file10000$ time perl test.pl file >/dev/nullreal    0m0.480suser    0m0.442ssys     0m0.026s$ time awk -f test.awk file >/dev/nullreal    0m0.382suser    0m0.367ssys     0m0.011s$ time perl test.pl file >/dev/nullreal    0m0.481suser    0m0.431ssys     0m0.022s$ time awk -f test.awk file >/dev/nullreal    0m0.390suser    0m0.370ssys     0m0.010s

EDIT by Ed Morton (@ghostdog74 feel free to delete if you disapprove).

Maybe this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script is doing. It also uses tabs as the separator which the OP had originally asked for so it'd handle empty fields and it coincidentally pretties-up the output a bit for this particular case.

$ cat tst.awkBEGIN { FS=OFS="\t" }{    for (rowNr=1;rowNr<=NF;rowNr++) {        cell[rowNr,NR] = $rowNr    }    maxRows = (NF > maxRows ? NF : maxRows)    maxCols = NR}END {    for (rowNr=1;rowNr<=maxRows;rowNr++) {        for (colNr=1;colNr<=maxCols;colNr++) {            printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS)        }    }}$ awk -f tst.awk fileX       row1    row2    row3    row4column1 0       3       6       9column2 1       4       7       10column3 2       5       8       11

The above solutions will work in any awk (except old, broken awk of course - there YMMV).

The above solutions do read the whole file into memory though - if the input files are too large for that then you can do this:

$ cat tst.awkBEGIN { FS=OFS="\t" }{ printf "%s%s", (FNR>1 ? OFS : ""), $ARGIND }ENDFILE {    print ""    if (ARGIND < NF) {        ARGV[ARGC] = FILENAME        ARGC++    }}$ awk -f tst.awk fileX       row1    row2    row3    row4column1 0       3       6       9column2 1       4       7       10column3 2       5       8       11

which uses almost no memory but reads the input file once per number of fields on a line so it will be much slower than the version that reads the whole file into memory. It also assumes the number of fields is the same on each line and it uses GNU awk for ENDFILE and ARGIND but any awk can do the same with tests on FNR==1 and END.


rs

rs is a BSD utility which also comes with macOS, but it is available from package managers on other platforms. It is named after the reshape function in APL.

Use sequences of spaces and tabs as column separator:

rs -T

Use tab as column separator:

rs -c -C -T

Use comma as column separator:

rs -c, -C, -T

-c changes the input column separator and -C changes the output column separator. A lone -c or -C sets the separator to tab. -T transposes rows and columns.

Do not use -t instead of -T, because it automatically selects the number of output columns so that the output lines fill the width of the display (which is 80 characters by default but which can be changed with -w).

When an output column separator is specified using -C, an extra column separator character is added to the end of each row, but you can remove it with sed:

$ seq 4|paste -d, - -|rs -c, -C, -T1,3,2,4,$ seq 4|paste -d, - -|rs -c, -C, -T|sed s/.\$//1,32,4

This fails with tables where the first line ends with one or more empty columns, because the number of columns is determined based on the number of columns on the first row:

$ rs -c, -C, -T<<<$'1,\n3,4'1,3,4,

gawk

$ seq 4|paste -d, - -|awk '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?"\n":FS),a[i][j]}' FS=,1,32,4

This uses arrays of arrays which is a gawk extension. macOS comes with a version of nawk from 2007 which does not support arrays of arrays.

To use space as a separator without collapsing sequences of space and tab characters, use FS='[ ]'.

Ruby

$ seq 4|paste -d, - -|ruby -e'STDIN.map{|x|x.chomp.split(",",-1)}.transpose.each{|x|puts x*","}'1,32,4

The -1 argument to split disables discarding empty fields at the end:

$ ruby -e'p"a,,".split(",")'["a"]$ ruby -e'p"a,,".split(",",-1)'["a", "", ""]

Function form:

$ tp(){ ruby -e's=ARGV[0];STDIN.map{|x|x.chomp.split(s==" "?/ /:s,-1)}.transpose.each{|x|puts x*s}' -- "${1-$'\t'}";}$ seq 4|paste -d, - -|tp ,1,32,4

s==" "?/ /:s is used above because when the argument to the split function is a single space, it enables awk-like special behavior where strings are split based on contiguous runs of spaces and tabs:

$ ruby -e'p" a  \tb ".split(/ /,-1)'["", "a", "", "\tb", ""]$ ruby -e'p" a  \tb ".split(" ",-1)'["a", "b", ""]

jq

tp(){ jq -R .|jq --arg x "${1-$'\t'}" -sr 'map(./$x)|transpose|map(join($x))[]';}

jq -R . prints each input line as a JSON string literal, -s (--slurp) creates an array for the input lines after parsing each line as JSON, and -r (--raw-output) outputs the contents of strings instead of JSON string literals. The / operator is overloaded to split strings.

R

$ printf %s\\n 1,2 3,4|Rscript -e 'write.table(t(read.table("stdin",sep=",")),"",sep=",",quote=F,col.names=F,row.names=F)'1,32,4

If you replace Rscript with R, it echoes the code that is being run to STDOUT. It also results in the error ignoring SIGPIPE signal if it is followed by a command like head -n1 which exits before it has read the whole STDIN.

write.table prints to STDOUT when the argument for the output file is an empty string.


A Python solution:

python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output

The above is based on the following:

import sysfor c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())):    print(' '.join(c))

This code does assume that every line has the same number of columns (no padding is performed).