Run 4 concurrent instances of a python script on a folder of data files Run 4 concurrent instances of a python script on a folder of data files multithreading multithreading

Run 4 concurrent instances of a python script on a folder of data files


You can use the multiprocessing-module. I suppose you have a list of files to process and a function to call for each file. Then you could simply use a worker-pool like this:

from multiprocessing import Pool, cpu_countpool = Pool(processes=cpu_count)pool.map(process_function, file_list, chunksize=1)

If your process_function doesn't return a value, you can simply ignore the return-value.


Take a look at xargs. It's -P option offers a configurable degree of parallelism. Specifically, something like this should work for you:

ls files* | awk '{print $1,$1".out"}' | xargs -P 4 -n 2 python fastq_groom.py


Give this a shot:

#!/bin/bashfiles=( * )for((i=0;i<${#files[@]};i+=4)); do  {      python fastq_groom.py "${files[$i]}" "${files[$i]}".out &     python fastq_groom.py "${files[$i+1]}" "${files[$i+1]}".out &     python fastq_groom.py "${files[$i+2]}" "${files[$i+2]}".out &     python fastq_groom.py "${files[$i+3]}" "${files[$i+3]}".out &  }done

The following puts all files into an array named files. It then executes and backgrounds four python processes on the first four files. As soon as all four of those processes are complete, it executes the next four. It's not as efficient as always keeping a queue of 4 going but if all processes take around the same amount of time, it should be pretty close to that.

Also, please please please don't use the output of ls like that. Just use standard globbing as in for files in *.txt; do ...; done