2

I have a shell script on Linux to extract NetCDF data for each grid into a .txt file.

This is the annual averaged NetCDF data that I used in this conversion (originally was daily data), the size was 85.1 MB https://drive.google.com/file/d/1Nud5b3KvjtE8H4Ol73zdnNSb9eeItaY7/view?usp=sharing

The script runs correctly without any errors. All of the mentioned script has been checked using https://www.shellcheck.net/ and there is no error.

This is my script named extract_global.sh

#!/bin/bash
# Define input file, log file, and output directory
input_file="canesm5_r1i1p1f1_w5e5_ssp126_tas_global_daily_2015_2100_05gc.nc"
log_file="log.txt"
output_dir="/path/output/" # Replace with the desired output directory
export PATH=/home/appl/cdo-1.9.8/bin/cdo:$PATH
# Remove existing log file if it exists
rm -f $log_file
# Create the output directory if it doesn't exist
#mkdir -p $output_dir
# Function to extract data for a specific latitude
extract_latitude() {
 lat=1ドル
 lon_index=1
 max_jobs=20
 job_count=0
 pids=()
 # Loop through longitudes from -180 to 180 in 0.5 degree increments
 for lon in $(seq -180 0.5 179.5); do
 # Define the output file name based on the current latitude and longitude index
 output_file="${output_dir}${lat}_${lon_index}.txt"
 
 # Extract data for the grid point at the current latitude and longitude, save only values, and remove the header
 (/home/appl/cdo-1.9.8/bin/cdo -outputtab,value -remapnn,lon="${lon}"_lat="${lat}" $input_file | sed '1d' > "$output_file") &>> $log_file &
 
 # Store the PID of the background job
 pids+=($!)
 # Increment the longitude index and job count
 lon_index=$((lon_index + 1))
 job_count=$((job_count + 1))
 # Check if the max number of jobs has been reached
 if [ $job_count -ge $max_jobs ]; then
 # Wait for all the jobs to finish
 for pid in "${pids[@]}"; do
 wait "$pid"
 done
 pids=() # Reset the PID array
 job_count=0 # Reset job count
 fi
 done
 # Wait for any remaining background processes to finish
 for pid in "${pids[@]}"; do
 wait "$pid"
 done
}
# Loop through latitudes from 90N to -90S in 0.5 degree increments
for lat in $(seq 90 -0.5 -90); do
 echo "Extracting data for latitude $lat"
 extract_latitude "$lat"
done
echo "Data extraction completed. Check $log_file for details."

If I run ./extract_global.sh directly, the program runs correctly: extract every 20 grid points, wait until the current job is finished then continue the next loop, and produce the correct data extraction inside each produced .txt file.

By default, all jobs on our Linux server run on the main CPU. I want to change the CPU used for this job by changing to a different queue. The main CPU is used by several members, so running heavy jobs can slow down the server. Therefore, in the code above, I set the maximum number of jobs to 20 and used a loop. However, this makes the process very long and inefficient.

I tried to solve this matter, by creating a go.bat file to set up the queue and other configurations. The contents of the go.bat file are as follows:

#!/bin/sh
#$ -S /bin/sh
#$ -q ib.q@node02 # Specify the queue
#$ -N SSP_kzm # Define job name
#$ -j y # Standard output and error message will be written in the same file
#$ -e /path/log.txt # Specify the error file
#$ -M [email protected]
#$ -m be # Send email at the beginning and end of the job
# Load necessary modules
# module load cdo
# Or if cdo is in a custom directory, add it to the PATH
export PATH=/home/appl/cdo-1.9.8/bin/cdo:$PATH
cd /path/
./extract_global.sh >& log.txt

Finally, I ran the job with the settings in the run.sh file, as shown below:

#!/bin/bash
# Submit the conversion program and wait for it to complete
qsub go.bat
echo "Waiting for conversion computation"
date
# Wait for the job to finish
while qstat | grep -q kzm; do
 sleep 5
done
date
echo "Job completed"

I ran by using

nohup ./run.sh >& log.txt &

Then, I checked by qstat -f and I got this information:

queuename qtype resv/used/tot. np_load arch states
----------------------------------------------------------------------
ib.q@node02 BIP 0/1/20 0.05 lx-amd64 
19847 0.50500 SSP_kzm kzm r 10/10/2024 09:28:04 1 
----------------------------------------------------------------------

Results:

  • The program ignores the "wait" command inside 'extract_global.sh' script, and produces the output .txt file very quickly.
  • The program successfully generated all of the .txt files according to the grid area of the NetCDF file, but the .txt files were empty. In addition, the script produced numerous files named "cores.number" are around 4MB each file with unreadable characters if we open it directly using the text editor.
  • No errors detected in log.txt, but there is no value stored in the .txt output.
  • I'm not sure if the method I'm using is correct. I've looked for information and tried similar solutions, but I haven't been able to fix it yet.

The main point is that I want to run the ./extract_global.sh script on a queue named ib.q@node02 and I want to allocate the whole available CPU on that node (20 CPUs) to fasten the conversion process.

Should I use MPI parallel computation to be able to use the whole CPU on that node?

If someone is able to help me, I would be very grateful.

Thank you.

Ed Morton
209k18 gold badges90 silver badges212 bronze badges
asked Aug 8, 2024 at 13:19
11
  • I’m voting to close this question because the poster didn't use shellcheck.net as directed by the bash tag. Commented Aug 8, 2024 at 13:26
  • I don't have access to sample input files, and I couldn't spend the time even if I did. I recommend you read the sections Before asking about problematic code, AND How to turn a bad script into a good question from the samebash tag as mentioned above (this advice applies to all languages, not just bash). You should try to boil your problem down to code that readers can copy/paste into their environment and see the same problem. Not voting to close as you are an early poster, and you did post a very nice clean entry. Good luck! Commented Aug 8, 2024 at 17:45
  • @EdMorton, I apologize for my mistake, i didn't check my script previously because I just know about this and my conversion script worked well (there is no error). However, now I have checked all of my scripts using shellcheck.net. Thank you for the information Commented Aug 9, 2024 at 2:25
  • 1
    The immediate solution to your current error message is to add a line near the top of your extract_...sh script like export PATH="/full/path/to:$PATH" where the /full/path/to directory contains the cdo cmd. Not elegant, but this should get you thinking about the differences between your terminal environment and the environment where the command is failing. I mentioned "How to turn a bad script ..." because your problem, (as described now) is on one line of code. A much smaller script could have still demonstrrated this problem. Well. going to bed. Good luck! Commented Aug 9, 2024 at 4:01
  • 1
    I didn't suggest you change the title to "Replace the faulty scripts with the shellcheck-clean version" - I suggested that you edit your question to do that, i.e. remove the scripts currently present in the question and show the ones you fixed after running shellcheck instead so we're not looking at code that contains errors that shellcheck can detect. I put the title back as it was. Commented Aug 10, 2024 at 12:30

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.