My first non-trivial bash
script after fully moving to Linux Mint last week is attached (modify_cccp_bundles.in
). The script performs on protein databank (PDB) files that have been generated using an online CCCP (Coiled-coil Crick Parameterization) server. To use the generated PDB files for my own ends, I am using Rosetta (protein folding software) and phenix (protein modification software) that I have either set in my .bashrc
file or have set a source file in my home to. The operations are:
- add side chains and hydrogens to the CCCP templates using Rosetta
- clear the segment IDs of the PDBs
- change the residue sequences for all 4 chains so that they are continuous based on chain A (
increment_resseq
, there are 28 residues per chain) - add a segment ID of A to the files (first
awk
, segment ID is in column 75) - remove terminating lines from the file (there are 3 in-between the 4 chains that need removing)
- change the chain IDs of all the chains to A (chain ID is in column 22)
Once finished, I have files that are continuously numbered, the same segment ID and chain ID with side chains and hydrogen added that look like this:
Loops are added by another process afterwards. This is my code:
#!/bin/bash
# --- bash script to modify CCCP server generated PDB bundle templates
#
# CCCP: https://www.grigoryanlab.org/cccp/
# Phenix: https://phenix-online.org/documentation/index.html
# Rosetta: https://www.rosettacommons.org/
# ------------------------------------------------------------------- FUNCTIONS
# --- extract first character from string and return capitalised
res_one_letter_cap () {
abbrev=$(printf %.1s "1ドル")
Abbrev="${abbrev^}"
echo $Abbrev
}
# --- write Rosetta residue file for all Ala or Gly
write_resfiles () {
res_abbrev=$(res_one_letter_cap 1ドル)
file_name="all_${1}_resfile.res"
if [ -e $file_name ] # delete existing resfile so it is not being appended
then
rm $file_name
fi
echo -e $"PIKAA ${res_abbrev}\nstart" >> $file_name
}
# --- run rosetta fix backbone to add side chains and hydrogen
run_rosetta_fixbb () {
for pdb in ../../../CCCP/poly-${1}/*; do
fixbb.default.linuxgccrelease -s $pdb -resfile ../all_${1}_resfile.res \
-nstruct 1 -ex1 -ex2 -database $ROSETTA3_DB
done
}
# ------------------------------------------------------------------- CONSTANTS
residues='ala gly' # residue names
segID=A # segment ID
chainID=A # chain ID
# ------------------------------------------------------------------- SOURCE
#source ~/set_Phenix.sh # if path needs setting to phenix (set in ~/.bashrc)
source ~/set_Rosetta.sh # path to script containing rosetta_###/main/source/bin
# and the rosetta database path $ROSETTA3_DB
# ------------------------------------------------------------------- MAIN
# --- Rosetta fixed backbone: add side chains and hydrogen to PDB structures
mkdir ./rosetta && cd "$_"
for res in $residues; do
write_resfiles $res
mkdir ./poly_${res} && cd "$_"
run_rosetta_fixbb $res
rm scores.sc # scores.sc file not needed
cd ../
for pdb in ./poly_${res}/*; do # remove '_0001' Rosetta adds to filenames
ext="${pdb##*.}"
filename="${pdb%.*}"
mv $pdb "${filename%?????}".$ext
done
done
# --- phenix tools: change chain, segment ID's and renumber residues from
# --- constant offset defined in CONSTANTS
tmpfile="$(mktemp)"
phenixtmp=phenix.pdb
mkdir ../phenix/ && cd "$_"
for res in $residues; do
mkdir -p ../templates/poly_${res}
for file in ../rosetta/poly_${res}/*; do
filename="${file##*/}"
cp $file $filename
phenix.pdbtools $filename clear_seg_id=true output.filename=$phenixtmp
phenix.pdbtools $phenixtmp modify.selection="chain B" \
increment_resseq=28 output.filename=$phenixtmp
phenix.pdbtools $phenixtmp modify.selection="chain C" \
increment_resseq=56 output.filename=$phenixtmp
phenix.pdbtools $phenixtmp modify.selection="chain D" \
increment_resseq=84 output.filename=$phenixtmp
# to add 'A' as segment
awk '{print substr(0,1,75ドル) v substr(0,76ドル)}' v=$segID $phenixtmp \
> "$tmpfile" && mv "$tmpfile" $phenixtmp
# remove TER lines
awk '!/TERA/' $phenixtmp > "$tmpfile" && mv "$tmpfile" $phenixtmp
# change chain to A
awk '{print substr(0,1,21ドル) v substr(0,23ドル)}' v=$chainID $phenixtmp\
> "$tmpfile" && mv "$tmpfile" $phenixtmp
# copy to template dir
mv $phenixtmp ../templates/poly_${res}/$filename
rm $filename
done
done
The folder structure looks like this afterwards (using just two PDBs, I have 64 but could in the future be more):
CCCP/
poly-ala/
ala_0001.pdb
ala_0002.pdb
poly-gly/
gly_0001.pdb
gly_0002.pdb
cccp_template_mods/
modify_cccp_bundles.in
rosetta/
poly_ala/
ala_0001.pdb
ala_0001.pdb
poly_gly/
gly_0001.pdb
gly_0001.pdb
all_ala_resfile.res
all_gly_resfile.res
phenix/
templates/
poly_ala/
ala_0001.pdb
ala_0002.pdb
poly_gly/
gly_0001.pdb
gly_0002.pdb
The ~/set_Rosetta.sh
file contains:
export PATH=/usr/local/rosetta.source.release-340/main/source/bin:$PATH
export PYTHONPATH=/usr/local/rosetta.source.release-340/main/pyrosetta_scripts/apps/:$PYTHONPATH
export ROSETTA3=/usr/local/rosetta.source.release-340/main/
export ROSETTA3_DB=/usr/local/rosetta.source.release-340/main/database/
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/rosetta.source.release-340/main/source/build/external/release/linux/5.15/64/x86/gcc/11/default:$LD_LIBRARY_PATH
The generated all_{ala/gly}_resfile.res
contain (#
is a
for ala, g
for gly):
PIKAA #
start
There is no extra ending blank line as this breaks Rosetta functions. Each numbered PDB is the same but with different residue names i.e. ala_0001.pdb
is the alanine equivalent of the glycine backbone gly_0001.pdb
.
The code could be improved by removing the non-templates
directories but I am hesitant to include rm -rf
in a bash
script, I think it could all be done in one loop too. The phenix software will not accept a relative file in the input hence the need for phenixtmp
. I considered using a case
function for the phenix modifications but couldn't work out the best way to do so other than a loop through chains in a list and times the magic number 28
by an index whilst calling the phenix.pdbtools
.
I am more used to TeX.SE etiquette so if I have missed something out of this question please let me know.
1 Answer 1
lint
You should run shellcheck
on this, and follow its advice.
Maybe your directory names lack SPACE characters ATM,
but that might change in future, so it's worthwhile
to habitually quote $file_name
expressions.
In write_resfiles
, it's enough to
unconditionally rm -f "$file_name"
.
No harm done if it doesn't exist, so no need to check beforehand.
portability
I found this a little hard to read:
echo -e $"PIKAA ${res_abbrev}\nstart"
I think you were (roughly) writing echo $"\n"
,
same as echo -e "\n"
.
To use both left me scratching my head.
Typical solution would be to prefer printf
over echo
, here.
You should document the versions of tools you tested against,
including $BASH_VERSINFO
and the MacOS or linux distro OS version.
conservative globbing
for pdb in ../../../CCCP/poly-${1}/*; do
The trailing /*
is perhaps a little more permissive than necessary.
If those files share *.pdb
or another common suffix,
recommend you include that in the glob.
Not putting do
on its own line is idosyncratic, but ok, whatever.
Bash displays the parsed code a bit differently
with: $ type run_rosetta_fixbb
.
idempotency
mkdir ./rosetta
Sometimes things don't go smoothly and we need to edit / re-run several times.
Consider using mkdir -p
here, so it's not an error if
the directory already exists.
use conservative settings
Consider putting "strict mode" near the top of this file:
set -u -e -o pipefail
That causes
- unset / typo variable names to give error rather than empty string,
- immediate bailout on error (any command's non-zero return status), and
- same bailout if command within a pipeline fails
That would make it easier to reason about
"what is $PWD ?" after several {mkdir
, cd
} combos.
directory changing
Consider writing a helper function: mkdir -p "$DIR"; cd "$DIR"
cd ../
Certainly one can write cd foo; cmd1; cd ..; cmd2
You may find it more convenient to fork a subshell:
(cd foo && cmd1)
cmd2
$CWD changes in the subshell, but not in the parent,
so cmd2
does not run down in the foo folder.
You have several stanzas of "do stuff in a subdir" and then next stanza will move on to another subdir to do more stuff. Consider breaking out each stanza into a named function, so your top-level logic is short and simple.
use conservative regex
awk '!/TERA/' $phenixtmp
Maybe that's perfect the way it is.
But if you can anchor with ^
caret, /^TERA/
,
or anchor with $
dollar, /TERA$/
, then do so.
It cuts down on surprises, and it makes your regex go faster.
Consider using grep -v
for this simple task.
Consider using sed -i
to filter in place, without needing a mv
.
user feedback
Not sure how long this takes to run, nor whether 100% of all runs go successfully to the end.
You might want to set user expectations by using du
to estimate the amount of input data to be processed.
Logging "working on X" every minute or every few seconds
might be helpful.
Even if that's too chatty,
printing out "Finished" at the end wouldn't hurt.
Some scripts do one thing well, and compose nicely in pipelines.
This script seems a bit bigger than that.
Consider doing this for a command cmd
that takes a while to run:
(set -x && cmd some_input.pdb)
The (
)
subshell limits the effect of -x
to just displaying the single command along with its arguments.
This code appears to achieve its design goals.
It relies on working versions of several tools to be installed. Adding the install procedures, with an example console session, would reduce the cost of onboarding a newly hired engineer.
It would be helpful to maintainers to freeze a set of example inputs and output, together with generated console output, perhaps in a .ZIP archive. Given that, I would be willing to delegate or accept maintenance tasks on this script.