Clean up after shell script that is run via k8s pod on a gpu node for health check

Question 1

Team, I have a working script but when we upgraded some drivers on the GPU system it’s starting to cause zombie process and system is lowing down as script runs periodically as a crib job.

so I am trying to see is there a way I can have a graceful clean up after the Scripts run what is the best procedure to clean up after yourself when Scripts is done running? What wrong am I doing or what better could it be?

gpu-health-check.sh: |
 #!/bin/bash
 # Do nothing if nvidia-smi is found. This is not a GPU node.
 if ! [ -x "$(command -v nvidia-smi)" ]; then
 echo "nvidia-smi not found. Ignoring"
 exit 0
 fi
 # Check if there is any retired page. The query is copied from https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html.
 bad_gpus=$(nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv,noheader | cut -d, -f1| sort | uniq)
 if [ -z "${bad_gpus}" ]; then
 echo "No Single/Double Bit ECC Error found"
 exit 0
 fi
 for bad_gpu in "{$bad_gpus}"; do
 # Exit 1 if there is a pending page blacklist as we need to reboot the node to actually add the page to the blacklist.
 nvidia-smi -i $bad_gpu -q -d PAGE_RETIREMENT | grep Pending| grep Yes > /dev/null 2>&1
 if [ $? -eq 0 ]; then
 echo "Found pending blacklist on ${bad_gpu}"
 exit 1
 fi
 echo "No pending blacklist on ${bad_gpu}"
 done
 exit 0

On gpu system I see

741:19286 ? Z 0:00 [gpu-health-chec] <defunct>
757:30022 ? Z 0:00 [gpu-health-chec] <defunct>
761:31930 ? Z 0:00 [gpu-health-chec] <defunct>
762:31931 ? Z 0:00 [gpu-health-chec] <defunct>
794:37947 ? S 0:00 /bin/bash ./config/gpu-health-check.sh
795:37948 ? S 0:00 /bin/bash ./config/gpu-health-check.sh
796:37955 ? S 0:00 /bin/bash ./config/gpu-health-check.sh
803:37962 ? S 0:00 /bin/bash ./config/gpu-health-check.sh
816:50066 ? Z 0:00 [gpu-health-chec] <defunct>
817:50067 ? Z 0:00 [gpu-health-chec] <defunct>

All above are just piling up. How to avoid coz system is becoming irresponsible after hours.

Question 2

 #!/bin/bash

Why not plain /bin/sh? I don't see any non-POSIX shell constructs in there.

sort | uniq

We could replace with sort -u (that's a standard option).

 echo "Found pending blacklist on ${bad_gpu}"
 exit 1

That message should go to standard error stream: >&2. The same may be true of the other informational messages.

 for bad_gpu in "{$bad_gpus}"; do

Really? "{$bad_gpus}" is a single token; I think you meant $bad_gpus there. Especially as we then expand $bad_gpu unquoted in the next line.

grep Pending| grep Yes > /dev/null 2>&1

If Pending and Yes always occur in the same order, we could simplify to a single command (and we don't need the redirection):

grep -q 'Pending.*Yes'

 if [ $? -eq 0 ]; then

That's an antipattern - it's a sign that you need to move the preceding statement into the if:

if nvidia-smi -i "$bad_gpu" ...
then

Toby Speight Toby Speight 87.2k14 gold badges104 silver badges322 bronze badges · Answer 1 · 2020-02-26 17:19:39Z

 #!/bin/bash

Why not plain /bin/sh? I don't see any non-POSIX shell constructs in there.

sort | uniq

We could replace with sort -u (that's a standard option).

 echo "Found pending blacklist on ${bad_gpu}"
 exit 1

That message should go to standard error stream: >&2. The same may be true of the other informational messages.

 for bad_gpu in "{$bad_gpus}"; do

Really? "{$bad_gpus}" is a single token; I think you meant $bad_gpus there. Especially as we then expand $bad_gpu unquoted in the next line.

grep Pending| grep Yes > /dev/null 2>&1

If Pending and Yes always occur in the same order, we could simplify to a single command (and we don't need the redirection):

grep -q 'Pending.*Yes'

 if [ $? -eq 0 ]; then

That's an antipattern - it's a sign that you need to move the preceding statement into the if:

if nvidia-smi -i "$bad_gpu" ...
then

Stack Exchange Network

Clean up after shell script that is run via k8s pod on a gpu node for health check

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Clean up after shell script that is run via k8s pod on a gpu node for health check

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions