Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

KatarinaYuan/StructTokenBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

4 Commits

Repository files navigation

Protein Structure Tokenization: Benchmarking and New Recipe

pytorch arxiv HuggingFace Hub license

PyTorch implementation of StructTokenBench, a benchmark for comprehensive evaluation on protein strcuture tokenization methods, and AminoAseed, an advanced VQ-VAE-based protein structure tokenizer. Code authored by Xinyu Yuan, and Zichen Wang.

Overview

StructTokenBench is a benchmark for comprehensively evaluating protein strcuture tokenization methods. We further developed AminoAseed that achieves an average of 6.31% performance improvement across 24 supervised tasks, 12.83% in sensitivity and 124.03%, compared to the leading model ESM3.

This repository is based on PyTorch 2.2 and Python 3.11

Main Method

Table of contents:

Features

  • A comprehensive benchmark for protein structure tokenizers, encompassing 9 different protein structure tokenizers.
  • Easy to extend to new structure tokenizers, and new datasets.
  • Pretraining recipe to reproduce ESM3's structure tokenizer
  • All data preprocessing details to curate residue-level protein supervised tasks

Updates

  • May 1st, 2025: StructTokenBench is accepted to ICML 2025!
  • Apr 24th, 2025: StructTokenBench code released!
  • Feb 28th, 2025: StructTokenBench preprint release on arxiv!

Installation

You may install the dependencies via the following bash command using conda environment.

conda create -n pstbench python=3.11
conda install pytorch==2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install lmdb
pip install --upgrade packaging 
pip install hydra-core
pip install lightning
pip install transformers
pip install deepspeed
pip install -U tensorboard
pip install ipdb
pip install esm
pip install cloudpathlib
pip install pipreqs
pip install lxml
pip install proteinshake
pip install tmtools
pip install tape_proteins
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.2.0+cu121.html
pip install accelerate
pip install torch_geometric
pip install line_profiler
pip install mini3di
pip install dm-tree
pip install colorcet
pip install ogb==1.2.1
pip install sympy
pip install ase
pip install torch-cluster
pip install jax==0.4.25
pip install tensorflow
pip install biopython
pip install seaborn

To enable Cheap, conflicts need to be resolved to install both esm3 and esm2, see ./src/baselines/README.md for details

General Configuration

export DIR=<your working directory>

Download

Model Checkpoints

CKPT_DIR=$DIR/struct_token_bench_release_ckpt
cd $CKPT_DIR
gdown https://drive.google.com/drive/folders/1s6mz6MQ7x1XLjt4veET7QT5fZ43_xO7n -O ./codebook_512x1024-1e+19-linear-fixed-last.ckpt --folder
gdown https://drive.google.com/drive/folders/1hl7gAe_Hn1pYQ3ow790ArISVbJ2lmJ8b -O ./codebook_512x1024-1e+19-PST-last.ckpt --folder

Pre-training Datasets

First download all the pdb files, which would also be useful for downstreams:

DOWNLOAD_DIR=$DIR/pdb_data/mmcif_files
cd $DOWNLOAD_DIR
aws s3 cp s3://openfold/pdb_mmcif.zip $DOWNLOAD_DIR --no-sign-request
unzip pdb_mmcif.zip
wget https://files.pdbj.org/pub/pdb/data/status/obsolete.dat

which should result in the following file structure:

├── pdb_data
│  └── mmcif_files
│  ├── mmcif_files
│  │  └──xxx.cif
│  ├── obsolete.dat

Then download the pretraining subsampled pdb indices list:

DOWNLOAD_DIR=$DIR/pdb_data/
cd $DOWNLOAD_DIR
gdown https://drive.google.com/uc?id=1UGPbnxeNwlg1jt514J6Foo07pQJEizHy
unzip pretrain.zip
mv pretrain_zip pretrain

Downstream Datasets

Using the following command:

cd $DIR
gdown https://drive.google.com/uc?id=1wJ4dSNdMyuF0985ET4UuwViHgV-clF4K
unzip struct_token_bench_release_data_download.zip
mv struct_token_bench_release_data_download struct_token_bench_release_data

which should result in the following file structure:

├── struct_token_bench_release_data
│  ├── data
│  ├── CATH
│  │  ├── cath-classification-data
│  │  └── sequence-data
│  ├── functional
│  │  └── local
│  │  ├── biolip2
│  │  ├── interpro
│  │  ├── proteinglue_epitoperegion
│  │  └── proteinshake_bindingsite
│  ├── physicochemical
│  │  ├── atlas
│  ├── sensitivity
│  │  ├── conformational
│  ├── structural
│  │  ├── remote_homology
│  ├── utility
│  │  ├── cameo
│  │  └── casp14

StructTokenBench - Benchmarking

Main Method

Setup

Across four perspectives, preprare the following arguments (taking ESM3 as an example):

tokenizer=WrappedESM3Tokenizer
tokenizername=esm3
d_model=128
lr=0.001
EXTRA_MODEL_ARGS=""
# '...' needs to be filled with the content below, different for each task
EXTRA_TASK_ARGS=...
target_field=... experiment_prefix=...
SHARED_ARGS="tokenizer=$tokenizer model.d_model=$d_model trainer.devices=[0] optimization.optimizer.lr=$lr data.target_field=$target_field experiment_name=${experiment_prefix}_${tokenizername}_lr${lr} run_name=tryout_test default_data_dir=$DIR/struct_token_bench_release_data/ data.pdb_data_dir=$DIR/pdb_data/mmcif_files/ trainer.default_root_dir=$DIR/struct_token_bench_logs/ ${EXTRA_TASK_ARGS} ${EXTRA_MODEL_ARGS}"
# task-specific python command

For ESM3, remember to login onto user's HuggingFace account to get access to ESM3:

from huggingface_hub import login
login(token=xxx)

Benchmark all different tokenizers, using the following arguments:

tokenizer_list=(WrappedESM3Tokenizer WrappedFoldSeekTokenizer WrappedProTokensTokenizer WrappedProteinMPNNTokenizer WrappedMIFTokenizer WrappedCheapS1D64Tokenizer WrappedAIDOTokenizer)
tokenizer_name_list=(esm3 foldseek protokens proteinmpnn mif cheapS1D64 aido)
dmodel_list=(128 2 32 128 256 64 384)
for i in "${!tokenizer_list[@]}"
do
 tokenizer=${tokenizer_list[i]}
 tokenizername=${tokenizer_name_list[i]}
 d_model=${dmodel_list[i]}
 echo $tokenizer, $d_model
 for lr in "0.1" "0.01" "0.001" "0.0001" "0.00005" "0.00001" "0.000005" "0.000001";
 do
 echo $lr, "bindint_${tokenizername}_lr${lr}"
 EXTRA_MODEL_ARGS=""
 
 # EXTRA_TASK_ARGS=...
 # target_field=... experiment_prefix=...
 
 SHARED_ARGS="tokenizer=$tokenizer model.d_model=$d_model trainer.devices=[0] optimization.optimizer.lr=$lr data.target_field=$target_field experiment_name=${experiment_prefix}_${tokenizername}_lr${lr} run_name=tryout_test default_data_dir=$DIR/struct_token_bench_release_data/ data.pdb_data_dir=$DIR/pdb_data/mmcif_files/ ${EXTRA_TASK_ARGS} ${EXTRA_MODEL_ARGS}"
 
 # task-specific python command
 done
done

Benchmark our pretrained tokenizer (AminoAseed or VanillaVQ). Remember to download the checkpoints first (see Model checkpoints). Use the following commands:

# using AminoAseed
ckpt_name="AminoAseed"
path="$DIR/struct_token_bench_release_ckpt/codebook_512x1024-1e+19-linear-fixed-last.ckpt/checkpoint/mp_rank_00_model_states.pt"
quantizer_use_linear_project=true
# using VanillaVQ
ckpt_name="VanillaVQ"
path="$DIR/struct_token_bench_release_ckpt/codebook_512x1024-1e+19-PST-last.ckpt/checkpoint/mp_rank_00_model_states.pt"
quantizer_use_linear_project=false
# general extra arguments besides $SHARED_ARGS
tokenizer=WrappedOurPretrainedTokenizer
tokenizername=ourpretrained_${ckpt_name}
d_model=1024
lr=0.001
quantizer_codebook_size=512
EXTRA_MODEL_ARGS="tokenizer_pretrained_ckpt_path=$path tokenizer_ckpt_name=${ckpt_name} quantizer_codebook_size=$quantizer_codebook_size quantizer_codebook_embed_size=$d_model model_encoder_dout=$d_model quantizer_use_linear_project=$quantizer_use_linear_project"

Effectiveness

All augments are summarized in table for reference. See below for details and python running commands.

Task Database target_field experiment_prefix config_file EXTRA_TASK_ARGS
BindInt InterPro "binding_label" "bindint" interpro.yaml /
BindBio BioLIP2 "binding_label" "bindbio" biolip2.yaml /
BindShake ProteinShake "binding_site" "bindshake" proteinshake_binding_site.yaml /
CatInt InterPro "activesite_label" "catint" interpro.yaml /
CatBio BioLIP2 "catalytic_label" "catbio" biolip2.yaml /
Con InterPro "conservedsite_label" "con" interpro.yaml /
Rep InterPro "repeat_label" "rep" interpro.yaml /
Ept PtoteinGLUE "epitope_label" "ept" proteinglue_epitope_region.yaml /
FlexRMSF ATLAS "rmsf_score" "flexrmsf" atlas.yaml data.pdb_data_dir=$DIR/struct_token_bench_release_data/data/physicochemical/ lightning.callbacks.checkpoint.monitor="validation_spearmanr"
FlexBFactor ATLAS "bfactor_score" "flexbfactor" atlas.yaml data.pdb_data_dir=$DIR/struct_token_bench_release_data/data/physicochemical/ lightning.callbacks.checkpoint.monitor="validation_spearmanr"
FlexNEQ ATLAS "neq_score" "flexneq" atlas.yaml data.pdb_data_dir=$DIR/struct_token_bench_release_data/data/physicochemical/ lightning.callbacks.checkpoint.monitor="validation_spearmanr"
Homo TAPE "fold_label" "homo" remote_homology.yaml optimization.micro_batch_size=64

Functional Site Prediction (Per-residue Binary Classification)

Binding Site Prediction

# BindInt (from InterPro database)
target_field="binding_label" experiment_prefix="bindint"
EXTRA_TASK_ARGS=""
CUDA_VISIBLE_DEVICES=0 python ./src/script/run_supervised_task.py --config-name=interpro.yaml $SHARED_ARGS
# BindBio (from BioLIP2 database)
target_field="binding_label" experiment_prefix="bindbio"
EXTRA_TASK_ARGS=""
CUDA_VISIBLE_DEVICES=0 python ./src/script/run_supervised_task.py --config-name=biolip2.yaml $SHARED_ARGS 
# BindShake (from ProteinShake database)
target_field="binding_site" experiment_prefix="bindshake"
EXTRA_TASK_ARGS=""
CUDA_VISIBLE_DEVICES=0 python ./src/script/run_supervised_task.py --config-name=proteinshake_binding_site.yaml $SHARED_ARGS

Catalytic Site Prediction

# CatInt (from InterPro database)
target_field="activesite_label" experiment_prefix="catint"
EXTRA_TASK_ARGS=""
CUDA_VISIBLE_DEVICES=0 python ./src/script/run_supervised_task.py --config-name=interpro.yaml $SHARED_ARGS 
# CatBio (from BioLIP2 database)
target_field="catalytic_label" experiment_prefix="catbio"
EXTRA_TASK_ARGS=""
CUDA_VISIBLE_DEVICES=0 python ./src/script/run_supervised_task.py --config-name=biolip2.yaml $SHARED_ARGS

Conserved Site Prediction

# Con (from InterPro database)
target_field="conservedsite_label" experiment_prefix="con"
EXTRA_TASK_ARGS=""
CUDA_VISIBLE_DEVICES=0 python ./src/script/run_supervised_task.py --config-name=interpro.yaml $SHARED_ARGS

Repeat Motif Prediction

# Rep (from InterPro database)
target_field="repeat_label" experiment_prefix="rep"
EXTRA_TASK_ARGS=""
CUDA_VISIBLE_DEVICES=0 python ./src/script/run_supervised_task.py --config-name=interpro.yaml $SHARED_ARGS

Epitope Region Prediction

# Ept (from PtoteinGLUE database)
target_field="epitope_label" experiment_prefix="ept"
EXTRA_TASK_ARGS=""
CUDA_VISIBLE_DEVICES=0 python ./src/script/run_supervised_task.py --config-name=proteinglue_epitope_region.yaml $SHARED_ARGS

Physicochemical Property Prediction (Per-residue Regression)

Structural Flexibility Prediction

# FlexRMSF (from ATLAS database)
target_field="rmsf_score" experiment_prefix="flexrmsf"
# FlexBFactor (from ATLAS database)
target_field="bfactor_score" experiment_prefix="flexbfactor"
# FlexNEQ (from ATLAS database)
target_field="neq_score" experiment_prefix="flexneq"
# EXTRA_TASK_ARGS and python commands are shared for FlexRMSF, FlexBFactor and FlexNEQ
EXTRA_TASK_ARGS="data.pdb_data_dir=$DIR/struct_token_bench_release_data/data/physicochemical/ lightning.callbacks.checkpoint.monitor='validation_spearmanr'"
CUDA_VISIBLE_DEVICES=0 python ./src/script/run_supervised_task.py --config-name=atlas.yaml $SHARED_ARGS

Structure Property Prediction (Per-protein Multiclass classification)

Remote Homology Detection

# Homo (TAPE)
target_field="fold_label" experiment_prefix="homo"
EXTRA_TASK_ARGS=optimization.micro_batch_size=64
CUDA_VISIBLE_DEVICES=0 python ./src/script/run_supervised_task.py --config-name=remote_homology.yaml $SHARED_ARGS

Sensitivity

Structural Similarity Consistency

target_field="tm_score" experiment_prefix="conformational"
EXTRA_TASK_ARGS="test_only=true experiment_name=${experiment_prefix}_${tokenizername}"
CUDA_VISIBLE_DEVICES=0 python ./src/script/run_supervised_task.py --config-name=conformational_switch.yaml $SHARED_ARGS

Distinctiveness

Code Pairwise Similarity

target_field=null task_goal="codebook_diversity" experiment_prefix="${task_goal}_casp14"
EXTRA_TASK_ARGS="test_only=true model.task_goal=${task_goal} experiment_name=${experiment_prefix}_${tokenizername} optimization.micro_batch_size=1"
CUDA_VISIBLE_DEVICES=0 python ./src/script/run_supervised_task.py --config-name=casp14.yaml $SHARED_ARGS
# after getting all pairwise similarities from different tokenizers, visualze with the following code
python run_plot_codebook_diversity.py

Codebook Utilization (Efficiency)

Code Usage Frequency or Compression Ratio

# CASP14
target_field=null task_goal="codebook_utilization" experiment_prefix="${task_goal}_casp14"
EXTRA_TASK_ARGS="test_only=true model.task_goal=${task_goal} experiment_name=${experiment_prefix}_${tokenizername} optimization.micro_batch_size=8"
CUDA_VISIBLE_DEVICES=0 python ./src/script/run_supervised_task.py --config-name=casp14.yaml $SHARED_ARGS
# CAMEO
target_field=null task_goal="codebook_utilization" experiment_prefix="${task_goal}_cameo"
EXTRA_TASK_ARGS="test_only=true model.task_goal=${task_goal} experiment_name=${experiment_prefix}_${tokenizername} optimization.micro_batch_size=8"
CUDA_VISIBLE_DEVICES=0 python ./src/script/run_supervised_task.py --config-name=cameo.yaml $SHARED_ARGS

AminoAseed - Our Structure Tokenizer

Please first run code for Code Usage Frequency under Codebook Utilization evaluation with ESM3 tokenizer to preprocess the test data CASP14 and CAMEO.

Pre-train VanillaVQ and AminoAseed

# VanillaVQ
use_linear_project=false
freeze_codebook=false
model_name="VanillaVQ"
# AminoAseed
use_linear_project=true
freeze_codebook=true
model_name="AminoAseed"
 
# shared command
warmup_step=5426
total_step=108530
lr=0.0001
fast_dev=false # enable to debug with 500 samples
python ./src/script/run_pretraining_vqvae.py --config-name=pretrain.yaml \  
 tokenizer=WrappedESM3Tokenizer trainer.devices=[0,1,2,3] \
 optimization.micro_batch_size=4 \
 optimization.scheduler.num_warmup_steps=${warmup_step} \
 max_steps=${total_step} \
 optimization.optimizer.lr=$lr \
 optimization.scheduler.plateau_ratio=0.0 \
 lightning.callbacks.checkpoint.monitor="validation_bb_rmsd" \
 lightning.callbacks.checkpoint.mode="min" \
 lightning.callbacks.checkpoint.save_top_k=1 \
 trainer.log_every_n_steps=512 \
 data.fast_dev_run=${fast_dev} \
 data.data_version=mmcif_files_filtered_subsample10 \
 experiment_name=vqvae-pretrain-subsample10_${model_name}_fastdev${fast_dev} \
 run_name=test \
 model.quantizer.use_linear_project=${use_linear_project} \
 model.quantizer.freeze_codebook=${freeze_codebook} \
 model.ckpt_path='' \
 default_data_dir=$DIR/struct_token_bench_release_data/ \
 data.pdb_data_dir=$DIR/pdb_data/mmcif_files/ \
 trainer.default_root_dir=$DIR/struct_token_bench_logs/

Citation

If you find this codebase useful in your research, please cite the original papers.

@article{yuan2025protein,
 title={Protein Structure Tokenization: Benchmarking and New Recipe},
 author={Yuan, Xinyu and Wang, Zichen and Collins, Marcus and Rangwala, Huzefa},
 journal={arXiv preprint arXiv:2503.00089},
 year={2025}
}

About

A benchmark for comprehensive evaluation on protein structure tokenization methods

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /