How can I merge multiple lines to create exactly two records based on field separators?
The 'sed' approach:
sed ':a;N;$!ba;s/\n|/|/g' input.txt
Though, awk would be faster & easier to understand/maintain. I just had that example handy (a common solution for removing trailing newlines w/ sed).
EDIT:
To clarify the difference between this answer (option #1) and the alternative solution by @potong (which I actually prefer: sed ':a;N;s/\n|/|/;ta;P;D' file
), which I'll call option #2:
- note that these are two of many possible options with
sed
. I actually prefer non-sed
solutions since they do in general run faster. But these two options are notable because they demonstrate two distinct ways to process a file: option #1 all in-memory, and option #2 as a stream. (note: below when I say "buffer", technically I mean "pattern space"): - option #1 reads the whole file into memory:
:a
is just a label;N
says append the next line to the buffer; if end-of-file ($
) is not (!
) reached, then branch (b
) back to label:a
...- then after the whole file is read into memory, process the buffer with the substitution command (
s
), replacing all occurrences of "\n|
" (newline followed by "|
") with just a "|
", on the entire (g
) buffer
- option #2 just process a couple lines at a time:
- reads / appends the next line (
N
) into the buffer, processes it (s/\n|/|/
); branches (t
) back to label:a
only if the substitution was successful; otherwise prints (P
) and clears/deletes (D
) the current buffer up to the first embedded newline ... and the stream continues.
- reads / appends the next line (
- option #1 takes a lot more memory to run. In general, as large as your file. Option #2 requires minimal memory; so small I didn't bother to see what it correlates to (I'm guessing the length of a line.)
- option #1 runs faster. In general, twice as fast as option #2; but obviously it depends on the file and what is being done.
On a ~500MB file, option #1 runs about twice as fast (1.5s vs 3.4s),
$ du -h /tmp/foobar.txt544M /tmp/foobar.txt$ time sed ':a;N;$!ba;s/\n|/|/g' /tmp/foobar.txt > /dev/nullreal 0m1.564suser 0m1.390ssys 0m0.171s$ time sed ':a;N;s/\n|/|/;ta;P;D' /tmp/foobar.txt > /dev/null real 0m3.418suser 0m3.239ssys 0m0.163s
At the same time, option #1 takes about 500MB of memory, and option #2 requires less than 1MB:
$ ps -F -C sedUID PID PPID C SZ RSS PSR STIME TTY TIME CMDusername 4197 11001 99 172427 558888 1 19:22 pts/10 00:00:01 sed :a;N;$!ba;s/\n|/|/g /tmp/foobar.txtnote: /proc/{pid}/smaps (Pss): 558188 (545M)
And option #2:
$ ps -F -C sedUID PID PPID C SZ RSS PSR STIME TTY TIME CMDusername 4401 11001 99 3468 864 3 19:22 pts/10 00:00:03 sed :a;N;s/\n|/|/;ta;P;D /tmp/foobar.txtnote: /proc/{pid}/smaps (Pss): 236 (0M)
In summary (w/ commentary),
- if you have files of unknown size, streaming without buffering is a better decision.
- if every second matters, then buffering the entire file and processing it at once may be fine -- but ymmv.
- my personal experience with tuning shell scripts is that
awk
orperl
(ortr
, but it's the least portable) or evenbash
may be preferable to usingsed
. - yet,
sed
is a very flexible and powerful tool that gets a job done quickly, and can be tuned later.
Here is an awk
solution:
$ awk 'substr($0,1,1)=="|"{printf $0;next} {printf "\n"$0} END{print""}' data200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||
Explanation:
Awk implicitly loops through every line in the file.
substr($0,1,1)=="|"{printf $0;next}
If this line begins with a vertical bar, then print it (without a final newline) and then skip to the next line. We are using
printf
here, as opposed to the more commonprint
, so that newlines are not printed unless we explicitly ask for them.{printf "\n"$0}
If the line didn't begin with a vertical bar, print a newline and then this line (again without a final newline).
END{print""}
At the end of the file, print a newline.
Refinement
The above prints out an extra newline at the beginning of the file. If that is a problem, then it can be eliminated with just a minor change:
$ awk 'substr($0,1,1)=="|"{printf $0;next} {printf new $0;new="\n"} END{print""}' data200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||