Parallel execution of Unix command?
--pipe
is inefficient (though not at the scale your are measuring - something is very wrong on your system). It can deliver in the order of 1 GB/s (total).
--pipepart
is, on the contrary, highly efficient. It can deliver in the order of 1 GB/s per core, provided your disk is fast enough. This should be the most efficient ways of processing data.txt1
. It will split data.txt1
in into one block per cpu core and feed those blocks into a wc -l
running on each core:
parallel --block -1 --pipepart -a data.txt1 wc -l
You need version 20161222 or later for block -1
to work.
These are timings from my old dual core laptop. seq 200000000
generates 1.8 GB of data.
$ time seq 200000000 | LANG=C wc -c1888888898real 0m7.072suser 0m3.612ssys 0m2.444s$ time seq 200000000 | parallel --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'1888888898real 1m28.101suser 0m25.892ssys 0m40.672s
The time here is mostly due to GNU Parallel spawning a new wc -c
for each 1 MB block. Increasing the block size makes it faster:
$ time seq 200000000 | parallel --block 10m --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'1888888898real 0m26.269suser 0m8.988ssys 0m11.920s$ time seq 200000000 | parallel --block 30m --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'1888888898real 0m21.628suser 0m7.636ssys 0m9.516s
As mentioned --pipepart
is much faster if you have data in a file:
$ seq 200000000 > data.txt1$ time parallel --block -1 --pipepart -a data.txt1 LANG=C wc -c | awk '{s+=$1} END {print s}'1888888898real 0m2.242suser 0m0.424ssys 0m2.880s
So on my old laptop I can process 1.8 GB in 2.2 seconds.
If you have only one core and your work is CPU dependent, then parallelizing will not help you. Parallelizing on a single core machine can make sense if most of the time is spent waiting (e.g. waiting for the network).
However, the timings from your computer tells me something is very wrong with that. I will recommend you test your program on another computer.
In short yes.. You will need more physical cores on the machines to get benefit from the parallel. Just for understanding your task ; following is what you intend to do
file1 is a 10,000,000 line filesplit into 4 files > file1.1 > processing > output1file1.2 > processing > output2file1.3 > processing > output3file1.4 > processing > output4>> cat output* > output ________________________________
And You want to parallelize the middle part and run it on 4 cores (hopefully 4 cores) simultaneously. Am I correct? I think you can use GNU parallel in much better way write a code for 1 of the files and use that command with (psuedocode warning )
parallel --jobs 4 "processing code on the file segments with sequence variable {}" ::: 1 2 3 4
Where -j is for number of processors.
UPDATEWhy are you trying parallel command for sequential execution within your file1.1 1.2 1.3 and 1.4?? Let it be regular sequential processing as you have coded
parallel 'for i in $(seq 1 250000);do cat file1.{} >> output{}.txt;done' ::: 1 2 3 4
The above code will run your 4 segmented files from csplit in parallel on 4 cores
for i in $(seq 1 250000);do cat file1.1 >> output1.txt;donefor i in $(seq 1 250000);do cat file1.2 >> output2.txt;donefor i in $(seq 1 250000);do cat file1.3 >> output3.txt;donefor i in $(seq 1 250000);do cat file1.4 >> output4.txt;done
I am pretty sure that --diskpart as suggested above by Ole is the better way to do it ; given that you have high speed data access from HDD.