1
\$\begingroup\$

Team, I have a working script but when we upgraded some drivers on the GPU system it’s starting to cause zombie process and system is lowing down as script runs periodically as a crib job.

so I am trying to see is there a way I can have a graceful clean up after the Scripts run what is the best procedure to clean up after yourself when Scripts is done running? What wrong am I doing or what better could it be?

gpu-health-check.sh: |
 #!/bin/bash
 # Do nothing if nvidia-smi is found. This is not a GPU node.
 if ! [ -x "$(command -v nvidia-smi)" ]; then
 echo "nvidia-smi not found. Ignoring"
 exit 0
 fi
 # Check if there is any retired page. The query is copied from https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html.
 bad_gpus=$(nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv,noheader | cut -d, -f1| sort | uniq)
 if [ -z "${bad_gpus}" ]; then
 echo "No Single/Double Bit ECC Error found"
 exit 0
 fi
 for bad_gpu in "{$bad_gpus}"; do
 # Exit 1 if there is a pending page blacklist as we need to reboot the node to actually add the page to the blacklist.
 nvidia-smi -i $bad_gpu -q -d PAGE_RETIREMENT | grep Pending| grep Yes > /dev/null 2>&1
 if [ $? -eq 0 ]; then
 echo "Found pending blacklist on ${bad_gpu}"
 exit 1
 fi
 echo "No pending blacklist on ${bad_gpu}"
 done
 exit 0

On gpu system I see

741:19286 ? Z 0:00 [gpu-health-chec] <defunct>
757:30022 ? Z 0:00 [gpu-health-chec] <defunct>
761:31930 ? Z 0:00 [gpu-health-chec] <defunct>
762:31931 ? Z 0:00 [gpu-health-chec] <defunct>
794:37947 ? S 0:00 /bin/bash ./config/gpu-health-check.sh
795:37948 ? S 0:00 /bin/bash ./config/gpu-health-check.sh
796:37955 ? S 0:00 /bin/bash ./config/gpu-health-check.sh
803:37962 ? S 0:00 /bin/bash ./config/gpu-health-check.sh
816:50066 ? Z 0:00 [gpu-health-chec] <defunct>
817:50067 ? Z 0:00 [gpu-health-chec] <defunct>

All above are just piling up. How to avoid coz system is becoming irresponsible after hours.

asked Feb 26, 2020 at 16:13
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$
 #!/bin/bash

Why not plain /bin/sh? I don't see any non-POSIX shell constructs in there.

sort | uniq

We could replace with sort -u (that's a standard option).

 echo "Found pending blacklist on ${bad_gpu}"
 exit 1

That message should go to standard error stream: >&2. The same may be true of the other informational messages.

 for bad_gpu in "{$bad_gpus}"; do

Really? "{$bad_gpus}" is a single token; I think you meant $bad_gpus there. Especially as we then expand $bad_gpu unquoted in the next line.

grep Pending| grep Yes > /dev/null 2>&1

If Pending and Yes always occur in the same order, we could simplify to a single command (and we don't need the redirection):

grep -q 'Pending.*Yes'
 if [ $? -eq 0 ]; then

That's an antipattern - it's a sign that you need to move the preceding statement into the if:

if nvidia-smi -i "$bad_gpu" ...
then
answered Feb 26, 2020 at 17:19
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.