Parallel execution of Unix command? Parallel execution of Unix command? unix unix

Parallel execution of Unix command?


--pipe is inefficient (though not at the scale your are measuring - something is very wrong on your system). It can deliver in the order of 1 GB/s (total).

--pipepart is, on the contrary, highly efficient. It can deliver in the order of 1 GB/s per core, provided your disk is fast enough. This should be the most efficient ways of processing data.txt1. It will split data.txt1 in into one block per cpu core and feed those blocks into a wc -l running on each core:

parallel  --block -1 --pipepart -a data.txt1 wc -l

You need version 20161222 or later for block -1 to work.

These are timings from my old dual core laptop. seq 200000000 generates 1.8 GB of data.

$ time seq 200000000 | LANG=C wc -c1888888898real    0m7.072suser    0m3.612ssys     0m2.444s$ time seq 200000000 | parallel --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'1888888898real    1m28.101suser    0m25.892ssys     0m40.672s

The time here is mostly due to GNU Parallel spawning a new wc -c for each 1 MB block. Increasing the block size makes it faster:

$ time seq 200000000 | parallel --block 10m --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'1888888898real    0m26.269suser    0m8.988ssys     0m11.920s$ time seq 200000000 | parallel --block 30m --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'1888888898real    0m21.628suser    0m7.636ssys     0m9.516s

As mentioned --pipepart is much faster if you have data in a file:

$ seq 200000000 > data.txt1$ time parallel --block -1 --pipepart -a data.txt1 LANG=C wc -c | awk '{s+=$1} END {print s}'1888888898real    0m2.242suser    0m0.424ssys     0m2.880s

So on my old laptop I can process 1.8 GB in 2.2 seconds.

If you have only one core and your work is CPU dependent, then parallelizing will not help you. Parallelizing on a single core machine can make sense if most of the time is spent waiting (e.g. waiting for the network).

However, the timings from your computer tells me something is very wrong with that. I will recommend you test your program on another computer.


In short yes.. You will need more physical cores on the machines to get benefit from the parallel. Just for understanding your task ; following is what you intend to do

file1 is a 10,000,000 line filesplit into 4 files > file1.1  > processing > output1file1.2  > processing > output2file1.3  > processing > output3file1.4  > processing > output4>> cat output* > output ________________________________

And You want to parallelize the middle part and run it on 4 cores (hopefully 4 cores) simultaneously. Am I correct? I think you can use GNU parallel in much better way write a code for 1 of the files and use that command with (psuedocode warning )

parallel --jobs 4 "processing code on the file segments with sequence variable {}"  ::: 1 2 3 4 

Where -j is for number of processors.

UPDATEWhy are you trying parallel command for sequential execution within your file1.1 1.2 1.3 and 1.4?? Let it be regular sequential processing as you have coded

parallel 'for i in $(seq 1 250000);do cat file1.{} >> output{}.txt;done' ::: 1 2 3 4 

The above code will run your 4 segmented files from csplit in parallel on 4 cores

for i in $(seq 1 250000);do cat file1.1 >> output1.txt;donefor i in $(seq 1 250000);do cat file1.2 >> output2.txt;donefor i in $(seq 1 250000);do cat file1.3 >> output3.txt;donefor i in $(seq 1 250000);do cat file1.4 >> output4.txt;done

I am pretty sure that --diskpart as suggested above by Ole is the better way to do it ; given that you have high speed data access from HDD.