split file on Nth occurrence of delimiter split file on Nth occurrence of delimiter unix unix

split file on Nth occurrence of delimiter


Using awk you could:

awk '/^\+$/ { delim++ } { file = sprintf("chunk%s.txt", int(delim / 50000)); print >> file; }' < input.txt 

Update:

To not include the delimiter, try this:

awk '/^\+$/ { if(++delim % 50000 == 0) { next } } { file = sprintf("chunk%s.txt", int(delim / 50000)); print > file; }' < input.txt 

The next keyword causes awk to halt processing rules for this record and and advance to the next (line). I also changed the >> to > since if you run it more than once you probably don't want to append the old chunk files.


It isn't very hard to do in Perl if you can't find a suitable alternative (and it will perform pretty well):

#!/usr/bin/env perluse strict;use warnings;# Configuration items - could be set by argument handlingmy $prefix = "rs.";     # File prefixmy $number = 1;         # First file numbermy $width  = 4;         # Number of digits to use in file namemy $rx     = qr/^\+$/;  # Match regexmy $limit  = 3;         # 50,000 in real casemy $quiet  = 0;         # Set to 1 to suppress file namessub next_file{    my $name = sprintf("%s%.*d", $prefix, $width, $number++);    open my $fh, '>', $name or die "Failed to open $name for writing";    print "$name\n" unless $quiet;    return $fh;}my $fh = next_file;  # Output file handlemy $counter = 0;     # Match counterwhile (<>){    print $fh $_;    $counter++ if (m/$rx/);    if ($counter >= $limit)    {        close $fh;        $fh = next_file;        $counter = 0;    }}close $fh;

That's far from being a one-liner; I'm not sure whether that's a merit or not. The items that should be configured are grouped together, and could be set via command line options, for example.You could end up with an empty file; you could spot that and remove it if necessary. You'd need a second counter; the existing one is a 'match counter' but you'd also need a line counter, and if the line counter was zero at the you'd remove the last file. You'd also need the name to be able to remove it...fiddly, but not difficult.

Give the input (basically two copies of your sample data), the output from repsplit.pl (repeat split) was as shown:

$ perl repsplit.pl datars.0001rs.0002rs.0003$ cat dataentry 1some more+entry 2some moreeven more+entry 3some more+entry 4some more+entry 1some more+entry 2some moreeven more+entry 3some more+entry 4some more+$ cat rs.0001entry 1some more+entry 2some moreeven more+entry 3some more+$ cat rs.0002entry 4some more+entry 1some more+entry 2some moreeven more+$ cat rs.0003entry 3some more+entry 4some more+$


Using and + as input separator in a concise "one-liner" :

If you'd like to do $_ > newprefix.part.$c like stated in your comment :

$ limit=50000 perl -053 -Mautodie -lne '    BEGIN{$\=""}    $count++;    if ($count >= $ENV{limit}) {        open my $fh, ">", "newprefix.part.$c";        print $fh $_;        close $fh;    }' file.txt$ ls -l newprefix.part.*

Doc