Team, I have a working script but when we upgraded some drivers on the GPU system it’s starting to cause zombie process and system is lowing down as script runs periodically as a crib job.
so I am trying to see is there a way I can have a graceful clean up after the Scripts run what is the best procedure to clean up after yourself when Scripts is done running? What wrong am I doing or what better could it be?
gpu-health-check.sh: |
#!/bin/bash
# Do nothing if nvidia-smi is found. This is not a GPU node.
if ! [ -x "$(command -v nvidia-smi)" ]; then
echo "nvidia-smi not found. Ignoring"
exit 0
fi
# Check if there is any retired page. The query is copied from https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html.
bad_gpus=$(nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv,noheader | cut -d, -f1| sort | uniq)
if [ -z "${bad_gpus}" ]; then
echo "No Single/Double Bit ECC Error found"
exit 0
fi
for bad_gpu in "{$bad_gpus}"; do
# Exit 1 if there is a pending page blacklist as we need to reboot the node to actually add the page to the blacklist.
nvidia-smi -i $bad_gpu -q -d PAGE_RETIREMENT | grep Pending| grep Yes > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "Found pending blacklist on ${bad_gpu}"
exit 1
fi
echo "No pending blacklist on ${bad_gpu}"
done
exit 0
On gpu system I see
741:19286 ? Z 0:00 [gpu-health-chec] <defunct>
757:30022 ? Z 0:00 [gpu-health-chec] <defunct>
761:31930 ? Z 0:00 [gpu-health-chec] <defunct>
762:31931 ? Z 0:00 [gpu-health-chec] <defunct>
794:37947 ? S 0:00 /bin/bash ./config/gpu-health-check.sh
795:37948 ? S 0:00 /bin/bash ./config/gpu-health-check.sh
796:37955 ? S 0:00 /bin/bash ./config/gpu-health-check.sh
803:37962 ? S 0:00 /bin/bash ./config/gpu-health-check.sh
816:50066 ? Z 0:00 [gpu-health-chec] <defunct>
817:50067 ? Z 0:00 [gpu-health-chec] <defunct>
All above are just piling up. How to avoid coz system is becoming irresponsible after hours.
1 Answer 1
#!/bin/bash
Why not plain /bin/sh
? I don't see any non-POSIX shell constructs in there.
sort | uniq
We could replace with sort -u
(that's a standard option).
echo "Found pending blacklist on ${bad_gpu}" exit 1
That message should go to standard error stream: >&2
. The same may be true of the other informational messages.
for bad_gpu in "{$bad_gpus}"; do
Really? "{$bad_gpus}"
is a single token; I think you meant $bad_gpus
there. Especially as we then expand $bad_gpu
unquoted in the next line.
grep Pending| grep Yes > /dev/null 2>&1
If Pending
and Yes
always occur in the same order, we could simplify to a single command (and we don't need the redirection):
grep -q 'Pending.*Yes'
if [ $? -eq 0 ]; then
That's an antipattern - it's a sign that you need to move the preceding statement into the if
:
if nvidia-smi -i "$bad_gpu" ...
then