How to split files up and process them in parallel and then stitch them back? unix How to split files up and process them in parallel and then stitch them back? unix unix unix

How to split files up and process them in parallel and then stitch them back? unix


The answer from @Ulfalizer gives you a good hint about the solution, but it lacks some details.

You can use GNU parallel (apt-get install parallel on Debian)

So your problem can be solved using the following command:

parallel -a infile.txt -l 1000 -j 10 -k --spreadstdin perl dosomething > result.txt

Here is the meaning of the arguments:

-a: read input from file instead of stdin-l 1000: send 1000 lines blocks to command-j 10: launch 10 jobs in parallel-k: keep sequence of output--spreadstdin: sends the above 1000 line block to the stdin of the command


I've never tried it myself, but GNU parallel might be worth checking out.

Here's an excerpt from the man page (parallel(1)) that's similar to what you're currently doing. It can split the input in other ways too.

EXAMPLE: Processing a big file using more cores       To process a big file or some output you can use --pipe to split up       the data into blocks and pipe the blocks into the processing program.       If the program is gzip -9 you can do:       cat bigfile | parallel --pipe --recend '' -k gzip -9 >bigfile.gz       This will split bigfile into blocks of 1 MB and pass that to gzip -9       in parallel. One gzip will be run per CPU core. The output of gzip -9       will be kept in order and saved to bigfile.gz

Whether this is worthwhile depends on how CPU-intensive your processing is. For simple scripts you'll spend most of the time shuffling data to and from the disk, and parallelizing won't get you much.

You can find some introductory videos by the GNU Parallel author here.


Assuming your limiting factor is NOT your disk, you can do this in perl with fork() and specifically Parallel::ForkManager:

#!/usr/bin/perluse strict;use warnings;use Parallel::ForkManager;my $max_forks = 8; #2x procs is usually optimalsub process_line {    #do something with this line}my $fork_manager = Parallel::ForkManager -> new ( $max_forks ); open ( my $input, '<', 'infile.txt' ) or die $!;while ( my $line = <$input> ) {    $fork_manager -> start and next;    process_line ( $line );    $fork_manager -> finish;}close ( $input );$fork_manager -> wait_all_children();

The downside of doing something like this though is that of coalescing your output. Each parallel task doesn't necessarily finish in the sequence it started, so you have all sorts of potential problems regarding serialising the results.

You can work around these with something like flock but you need to be careful, as too many locking operations can take away your parallel advantage in the first place. (Hence my first statement - if your limiting factor is disk IO, then parallelism doesn't help very much at all anyway).

There's various possible solutions though - so much the wrote a whole chapter on it in the perl docs: perlipc - but keep in mind you can retrieve data with Parallel::ForkManager too.