Run 4 concurrent instances of a python script on a folder of data files
You can use the multiprocessing
-module. I suppose you have a list of files to process and a function to call for each file. Then you could simply use a worker-pool like this:
from multiprocessing import Pool, cpu_countpool = Pool(processes=cpu_count)pool.map(process_function, file_list, chunksize=1)
If your process_function
doesn't return a value, you can simply ignore the return-value.
Take a look at xargs
. It's -P
option offers a configurable degree of parallelism. Specifically, something like this should work for you:
ls files* | awk '{print $1,$1".out"}' | xargs -P 4 -n 2 python fastq_groom.py
Give this a shot:
#!/bin/bashfiles=( * )for((i=0;i<${#files[@]};i+=4)); do { python fastq_groom.py "${files[$i]}" "${files[$i]}".out & python fastq_groom.py "${files[$i+1]}" "${files[$i+1]}".out & python fastq_groom.py "${files[$i+2]}" "${files[$i+2]}".out & python fastq_groom.py "${files[$i+3]}" "${files[$i+3]}".out & }done
The following puts all files into an array named files
. It then executes and backgrounds four python processes on the first four files. As soon as all four of those processes are complete, it executes the next four. It's not as efficient as always keeping a queue of 4 going but if all processes take around the same amount of time, it should be pretty close to that.
Also, please please please don't use the output of ls
like that. Just use standard globbing as in for files in *.txt; do ...; done