How to split files up and process them in parallel and then stitch them back? unix

The answer from @Ulfalizer gives you a good hint about the solution, but it lacks some details.

You can use GNU parallel (apt-get install parallel on Debian)

So your problem can be solved using the following command:

parallel -a infile.txt -l 1000 -j 10 -k --spreadstdin perl dosomething > result.txt

Here is the meaning of the arguments:

-a: read input from file instead of stdin-l 1000: send 1000 lines blocks to command-j 10: launch 10 jobs in parallel-k: keep sequence of output--spreadstdin: sends the above 1000 line block to the stdin of the command

bash perl unix split cat

I've never tried it myself, but GNU parallel might be worth checking out.

Here's an excerpt from the man page (parallel(1)) that's similar to what you're currently doing. It can split the input in other ways too.

EXAMPLE: Processing a big file using more cores       To process a big file or some output you can use --pipe to split up       the data into blocks and pipe the blocks into the processing program.       If the program is gzip -9 you can do:       cat bigfile | parallel --pipe --recend '' -k gzip -9 >bigfile.gz       This will split bigfile into blocks of 1 MB and pass that to gzip -9       in parallel. One gzip will be run per CPU core. The output of gzip -9       will be kept in order and saved to bigfile.gz

Whether this is worthwhile depends on how CPU-intensive your processing is. For simple scripts you'll spend most of the time shuffling data to and from the disk, and parallelizing won't get you much.

You can find some introductory videos by the GNU Parallel author here.

bash perl unix split cat

Assuming your limiting factor is NOT your disk, you can do this in perl with fork() and specifically Parallel::ForkManager:

#!/usr/bin/perluse strict;use warnings;use Parallel::ForkManager;my $max_forks = 8; #2x procs is usually optimalsub process_line {    #do something with this line}my $fork_manager = Parallel::ForkManager -> new ( $max_forks ); open ( my $input, '<', 'infile.txt' ) or die $!;while ( my $line = <$input> ) {    $fork_manager -> start and next;    process_line ( $line );    $fork_manager -> finish;}close ( $input );$fork_manager -> wait_all_children();

The downside of doing something like this though is that of coalescing your output. Each parallel task doesn't necessarily finish in the sequence it started, so you have all sorts of potential problems regarding serialising the results.

You can work around these with something like flock but you need to be careful, as too many locking operations can take away your parallel advantage in the first place. (Hence my first statement - if your limiting factor is disk IO, then parallelism doesn't help very much at all anyway).

There's various possible solutions though - so much the wrote a whole chapter on it in the perl docs: perlipc - but keep in mind you can retrieve data with Parallel::ForkManager too.

CodeHunter

How to split files up and process them in parallel and then stitch them back? unix

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last