How can I merge multiple lines to create exactly two records based on field separators? How can I merge multiple lines to create exactly two records based on field separators? shell shell

How can I merge multiple lines to create exactly two records based on field separators?


The 'sed' approach:

sed ':a;N;$!ba;s/\n|/|/g' input.txt

Though, awk would be faster & easier to understand/maintain. I just had that example handy (a common solution for removing trailing newlines w/ sed).

EDIT:

To clarify the difference between this answer (option #1) and the alternative solution by @potong (which I actually prefer: sed ':a;N;s/\n|/|/;ta;P;D' file), which I'll call option #2:

  • note that these are two of many possible options with sed. I actually prefer non-sed solutions since they do in general run faster. But these two options are notable because they demonstrate two distinct ways to process a file: option #1 all in-memory, and option #2 as a stream. (note: below when I say "buffer", technically I mean "pattern space"):
  • option #1 reads the whole file into memory:
    • :a is just a label; N says append the next line to the buffer; if end-of-file ($) is not (!) reached, then branch (b) back to label :a ...
    • then after the whole file is read into memory, process the buffer with the substitution command (s), replacing all occurrences of "\n|" (newline followed by "|") with just a "|", on the entire (g) buffer
  • option #2 just process a couple lines at a time:
    • reads / appends the next line (N) into the buffer, processes it (s/\n|/|/); branches (t) back to label :a only if the substitution was successful; otherwise prints (P) and clears/deletes (D) the current buffer up to the first embedded newline ... and the stream continues.
  • option #1 takes a lot more memory to run. In general, as large as your file. Option #2 requires minimal memory; so small I didn't bother to see what it correlates to (I'm guessing the length of a line.)
  • option #1 runs faster. In general, twice as fast as option #2; but obviously it depends on the file and what is being done.

On a ~500MB file, option #1 runs about twice as fast (1.5s vs 3.4s),

$ du -h /tmp/foobar.txt544M    /tmp/foobar.txt$ time sed ':a;N;$!ba;s/\n|/|/g' /tmp/foobar.txt > /dev/nullreal    0m1.564suser    0m1.390ssys 0m0.171s$ time sed  ':a;N;s/\n|/|/;ta;P;D'  /tmp/foobar.txt  > /dev/null real    0m3.418suser    0m3.239ssys 0m0.163s

At the same time, option #1 takes about 500MB of memory, and option #2 requires less than 1MB:

$ ps -F -C sedUID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMDusername  4197 11001 99 172427 558888 1 19:22 pts/10   00:00:01 sed :a;N;$!ba;s/\n|/|/g /tmp/foobar.txtnote: /proc/{pid}/smaps (Pss): 558188 (545M)

And option #2:

$ ps -F -C sedUID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMDusername  4401 11001 99  3468   864   3 19:22 pts/10   00:00:03 sed :a;N;s/\n|/|/;ta;P;D /tmp/foobar.txtnote: /proc/{pid}/smaps (Pss): 236 (0M)

In summary (w/ commentary),

  • if you have files of unknown size, streaming without buffering is a better decision.
  • if every second matters, then buffering the entire file and processing it at once may be fine -- but ymmv.
  • my personal experience with tuning shell scripts is that awk or perl (or tr, but it's the least portable) or even bash may be preferable to using sed.
  • yet, sed is a very flexible and powerful tool that gets a job done quickly, and can be tuned later.


Here is an awk solution:

$ awk 'substr($0,1,1)=="|"{printf $0;next} {printf "\n"$0} END{print""}' data200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||

Explanation:

Awk implicitly loops through every line in the file.

  • substr($0,1,1)=="|"{printf $0;next}

    If this line begins with a vertical bar, then print it (without a final newline) and then skip to the next line. We are using printf here, as opposed to the more common print, so that newlines are not printed unless we explicitly ask for them.

  • {printf "\n"$0}

    If the line didn't begin with a vertical bar, print a newline and then this line (again without a final newline).

  • END{print""}

    At the end of the file, print a newline.

Refinement

The above prints out an extra newline at the beginning of the file. If that is a problem, then it can be eliminated with just a minor change:

$ awk 'substr($0,1,1)=="|"{printf $0;next} {printf new $0;new="\n"} END{print""}' data200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||


This might work for you (GNU sed):

sed ':a;N;s/\n|/|/;ta;P;D' file

This processes the file a line at a time an alternative to @michael_n's which slurps the file content into memory before processing.