I am using a Bash script to execute a Python script multiple times. In order to speed up the execution, I would like to execute these (independent) processes in parallel. The code below does so:
#!/usr/bin/env bash
script="path_to_python_script"
N=16 # number of processors
mkdir -p data/
for i in `seq 1 1 100`; do
for j in {1..100}; do
((q=q%N)); ((q++==0)) && wait
if [ -e data/file_$i-$j.txt ]
then
echo "data/file_$i-$j.txt exists"
else
($script -args_1 $i > data/file_$i-$j.txt ;
$script -args_1 $i -args_2 value -args_3 value >> data/file_$i-$j.txt) &
fi
done
done
However, I am wondering if this code follow common best practices of parallelization of for loops in Bash? Are there ways to improve the efficiency of this code?
-
6\$\begingroup\$ You can use GNU Parallel, it’s very helpful the execute a number of tasks in a controlled way. \$\endgroup\$eckes– eckes2019年06月12日 20:55:34 +00:00Commented Jun 12, 2019 at 20:55
2 Answers 2
Some suggestions:
- The trailing slash in the
mkdir
command is redundant. $(...)
is preferred over backticks for command substitution.- Why use
seq
in one command? They both do the same loop, so you might as well use{1..100}
in both places. - Semicolons are unnecessary in the vast majority of cases. Simply use a newline to achieve the same separation between commands.
- Use More QuotesTM
set -o errexit -o noclobber -o nounset
at the start of the script will be helpful. It'll exit the script instead of overwriting any files, for example, so you can get rid of the innerif
statement if it's OK that the script stops when the file exists.[[
is preferred over[
.- The whole exercise is probably easier to achieve with some standard pattern like GNU parallel. Currently the script starts
N
commands, then waits for all of them to finish before starting any more. Unless the processes take very similar time this is going to waste a lot of time waiting. N
(or for exampleprocessors
for readability) should be determined dynamically, using for examplenproc --all
, rather than hardcoded.- If you're worried about speed you should probably not create a subshell for your two script commands.
{
and}
will group commands without creating a subshell. - For the same reason you probably want to do a single redirection like
{ "$script" ... && "$script" ...; } > "data/file_${i}-${j}.txt"
Since you're "only" counting to 10,000 you don't need to reset
q
every time. You can for example setprocess_count=0
outside the outer loop and check the modulo in a readable way such as:if [[ "$process_count" % "$processors" -eq 0 ]] then wait fi
- The inner code (from the line starting with
((q=q%N))
) should be indented one more time.
-
4\$\begingroup\$
[
is more portable and more consistent than[[
. So your preference is certainly arguable. \$\endgroup\$Toby Speight– Toby Speight2019年06月12日 11:41:53 +00:00Commented Jun 12, 2019 at 11:41 -
3\$\begingroup\$ @TobySpeight 1) Writing truly portable scripts is an absolute nightmare and not a good idea in terms of maintainability. 2) OP asked about Bash specifically. 3) The accepted answer (above the one you linked to) has about eight times more votes, so I would say the community has spoken. \$\endgroup\$l0b0– l0b02019年06月12日 20:58:19 +00:00Commented Jun 12, 2019 at 20:58
Using GNU Parallel you code will look something like this:
#!/usr/bin/env bash
export script="path_to_python_script"
doit() {
i="1ドル"
j="2ドル"
$script -args_1 "$i"
$script -args_1 "$i" -args_2 value -args_3 value
}
export -f doit
parallel --resume --results data/file_{1}-{2}.txt doit ::: {1..100} ::: {1..100}
In your original code if one job in a batch of 16 takes longer than the other 15, then you will have 15 cores sitting idle waiting for the last to finish.
Compared to your original code this will use the CPUs better because a new job is started as soon as a job finishes.
Explore related questions
See similar questions with these tags.