Parallelized for loop in Bash

Question 1

I am using a Bash script to execute a Python script multiple times. In order to speed up the execution, I would like to execute these (independent) processes in parallel. The code below does so:

#!/usr/bin/env bash
script="path_to_python_script"
N=16 # number of processors
mkdir -p data/
for i in `seq 1 1 100`; do
 for j in {1..100}; do
 ((q=q%N)); ((q++==0)) && wait
 if [ -e data/file_$i-$j.txt ]
 then
 echo "data/file_$i-$j.txt exists"
 else
 ($script -args_1 $i > data/file_$i-$j.txt ;
 $script -args_1 $i -args_2 value -args_3 value >> data/file_$i-$j.txt) &
 fi
 done
done

However, I am wondering if this code follow common best practices of parallelization of for loops in Bash? Are there ways to improve the efficiency of this code?

Question 2

You can use GNU Parallel, it’s very helpful the execute a number of tasks in a controlled way.

Question 3

Some suggestions:

The trailing slash in the mkdir command is redundant.
$(...) is preferred over backticks for command substitution.
Why use seq in one command? They both do the same loop, so you might as well use {1..100} in both places.
Semicolons are unnecessary in the vast majority of cases. Simply use a newline to achieve the same separation between commands.
Use More QuotesTM
set -o errexit -o noclobber -o nounset at the start of the script will be helpful. It'll exit the script instead of overwriting any files, for example, so you can get rid of the inner if statement if it's OK that the script stops when the file exists.
[[ is preferred over [.
The whole exercise is probably easier to achieve with some standard pattern like GNU parallel. Currently the script starts N commands, then waits for all of them to finish before starting any more. Unless the processes take very similar time this is going to waste a lot of time waiting.
N (or for example processors for readability) should be determined dynamically, using for example nproc --all, rather than hardcoded.
If you're worried about speed you should probably not create a subshell for your two script commands. { and } will group commands without creating a subshell.
For the same reason you probably want to do a single redirection like { "$script" ... && "$script" ...; } > "data/file_${i}-${j}.txt"
Since you're "only" counting to 10,000 you don't need to reset q every time. You can for example set process_count=0 outside the outer loop and check the modulo in a readable way such as:
```
if [[ "$process_count" % "$processors" -eq 0 ]]
then
 wait
fi
```
The inner code (from the line starting with ((q=q%N))) should be indented one more time.

Question 4

[ is more portable and more consistent than [[. So your preference is certainly arguable.

Question 5

@TobySpeight 1) Writing truly portable scripts is an absolute nightmare and not a good idea in terms of maintainability. 2) OP asked about Bash specifically. 3) The accepted answer (above the one you linked to) has about eight times more votes, so I would say the community has spoken.

Question 6

Using GNU Parallel you code will look something like this:

#!/usr/bin/env bash
export script="path_to_python_script"
doit() {
 i="1ドル"
 j="2ドル"
 $script -args_1 "$i"
 $script -args_1 "$i" -args_2 value -args_3 value
}
export -f doit
parallel --resume --results data/file_{1}-{2}.txt doit ::: {1..100} ::: {1..100}

In your original code if one job in a batch of 16 takes longer than the other 15, then you will have 15 cores sitting idle waiting for the last to finish.

Compared to your original code this will use the CPUs better because a new job is started as soon as a job finishes.

l0b0 l0b0 9,11722 silver badges36 bronze badges · Accepted Answer · 2019-06-12 10:17:37Z

Some suggestions:

The trailing slash in the mkdir command is redundant.
$(...) is preferred over backticks for command substitution.
Why use seq in one command? They both do the same loop, so you might as well use {1..100} in both places.
Semicolons are unnecessary in the vast majority of cases. Simply use a newline to achieve the same separation between commands.
Use More QuotesTM
set -o errexit -o noclobber -o nounset at the start of the script will be helpful. It'll exit the script instead of overwriting any files, for example, so you can get rid of the inner if statement if it's OK that the script stops when the file exists.
[[ is preferred over [.
The whole exercise is probably easier to achieve with some standard pattern like GNU parallel. Currently the script starts N commands, then waits for all of them to finish before starting any more. Unless the processes take very similar time this is going to waste a lot of time waiting.
N (or for example processors for readability) should be determined dynamically, using for example nproc --all, rather than hardcoded.
If you're worried about speed you should probably not create a subshell for your two script commands. { and } will group commands without creating a subshell.
For the same reason you probably want to do a single redirection like { "$script" ... && "$script" ...; } > "data/file_${i}-${j}.txt"
Since you're "only" counting to 10,000 you don't need to reset q every time. You can for example set process_count=0 outside the outer loop and check the modulo in a readable way such as:
```
if [[ "$process_count" % "$processors" -eq 0 ]]
then
 wait
fi
```
The inner code (from the line starting with ((q=q%N))) should be indented one more time.

[ is more portable and more consistent than [[. So your preference is certainly arguable.
@TobySpeight 1) Writing truly portable scripts is an absolute nightmare and not a good idea in terms of maintainability. 2) OP asked about Bash specifically. 3) The accepted answer (above the one you linked to) has about eight times more votes, so I would say the community has spoken.

Stack Exchange Network

Parallelized for loop in Bash

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parallelized for loop in Bash

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions