How to control parallel tasks in Linux to avoid too much context switch How to control parallel tasks in Linux to avoid too much context switch bash bash

How to control parallel tasks in Linux to avoid too much context switch


I wouldn't try to reinvent the load-balancing wheel by splitting the files. Use gnu parallel to handle the management of the tasks of different scales. It has plenty of options for parallel execution on one or multiple machines. If you set it up to, say, allow 4 processes in parallel, it will do that, starting a new task when a shorter one completes.

https://www.gnu.org/software/parallel/

https://www.gnu.org/software/parallel/parallel_tutorial.html

Here's a simple example using cat as a standin for ./program:

...write a couple of files% cat > aabc% cat > ba  bcd% cat > filesab... run the tasks% parallel cat {1} \> {1}.log < files% more b.logabcd


Since you are allowed to split files I assume that you are also allowed to combine files. In this case you could consider a fast preprocessing step as follows:

#! /bin/bash# set the number of parallel threadsCPU=6rm -f complete.out# combine all files into onewhile read parameterdo    cat $parameter >> complete.outdone < parameter_file# count the number of lineslines=$(wc -l complete.out|cut -d " " -f 1)lines_per_file=$(( $lines / $CPU + 1 ))# split the big file into equal pieces named xa*rm -f xa*split --lines $lines_per_file complete.out # create a parameter file to mimic the old calling behaviourrm -f new_parameter_filefor splinter in xa* ; do    echo $splinter >> new_parameter_filedone# this is the old call with just 'parameter_file' replaced by 'new_parameter_file'while read parameterdo    ./program_a $parameter $parameter.log 2>&1 &done < new_parameter_file

Notes:

  • The file name pattern xa* of the generated files may be different in your setup.
  • Make sure that the last line of each file actually has a CR/LF!


I also think I can use wait to archive the goal.

Indeed, you can achieve the goal with wait, even though bash's wait unfortunately waits for each process of a specified set, not for any one (that is, we can't simply instruct bash to wait for the earliest finishing process of all running), but since

The processing time for each task is almost linearly dependent on the number of line

and

I want to split each file to 1k lines

we can, to good approximation, say that the process started first also finishes first.

I assume you already have implemented the splitting of the files into 1000-line pieces (I can add that detail if desired) and their names are stored in the variable $files, in your example File_A000 File_B000 … File_B009 File_C000 … File_C999.

set --                                  # tasks stored in $1..$6for file in $filesdo  [ $# -lt 6 ] || { wait $1; shift; } # wait for and remove oldest task if 6    ./program_a $file $file.log 2>&1 &    set -- $* $!                        # store new task lastdonewait                                    # wait for the final tasks to finish