AWK Split File every n-th Row but group IDs together AWK Split File every n-th Row but group IDs together unix unix

AWK Split File every n-th Row but group IDs together


Using any awk in any shell on every Unix box:

$ cat tst.awk/^@/ {    hdr = hdr $0 ORS    next}( (++numLines) % 5 ) == 1 {    if ( $0 == prev ) {        --numLines    }    else {        close(out)        out = FILENAME "." (++numBlocks)        printf "%s", hdr > out        numLines = 1    }}{    print > out    prev = $0}

$ awk -f tst.awk text.txt

$ head text.txt.*==> text.txt.1 <==@something@somethingelse@anotherthing122333==> text.txt.2 <==@something@somethingelse@anotherthing44455==> text.txt.3 <==@something@somethingelse@anotherthing6778999==> text.txt.4 <==@something@somethingelse@anotherthing1011111114==> text.txt.5 <==@something@somethingelse@anotherthing15


Nice question.
With your example, this would work:

awk 'BEGIN{i=1;}/\@/{header= header == ""? $0 : header "\n" $0; next}c>=5 && $1!=prev{i++;c=0;}{if(!c) print header>FILENAME"."i; print > FILENAME"."i;c++;prev=$1;}' test.txt

You need strip the header out, and set a counter (c in above), NR is just current line number of the input, it will not meet your needs when the actual lines are not times of 5.

Break it up and improve a tiny bit:

awk 'BEGIN{i=1;}  /\@/{header= header == ""? $0 : header ORS $0; next}  c>=5 && $1!=prev{i++;c=0;}  !c {print header>FILENAME"."i;}  {print > FILENAME"."i;c++;prev=$1;}  ' test.txt

To solve the potential problems mentioned in the comment:

awk 'BEGIN{i=1}  /\@/{header= header == ""? $0 : header ORS $0; next}  c>=5 && $1!=prev{i++;c=0}  !c {close(f);f=(FILENAME"."i);print header>f}  {print>f;c++;prev=$1}  ' test.txt

or check Ed's answer which is more precise and different platforms/versions compatible.


With your shown samples, please try following awk program. Written and tested in GNU awk.

awk 'BEGIN{  outFile="test.txt"  count=1}/@/{  header=(header?header ORS:"")$0  next}{  arr[$0]=(arr[$0]?arr[$0] ORS:"")$0}END{  PROCINFO["sorted_in"] = "@ind_num_asc"  print header > (outFile count)  for(i in arr){    num=split(arr[i],arr2,"\n")    print arr[i] > (outFile count)    len+=num    if(len>=5){ len=0 }    if(len==0){      close(outFile count)      count++      print header > (outFile count)    }  }}'  Input_file