Name	Name	Last commit message	Last commit date
Latest commit History 131 Commits
.github/workflows	.github/workflows
configs	configs
otk_api	otk_api
src/otk	src/otk
.gitignore	.gitignore
AGENTS.md	AGENTS.md
CLAUDE.md	CLAUDE.md
LICENSE	LICENSE
MANIFEST.in	MANIFEST.in
README.md	README.md
optimized_hybrid_ensemble.py	optimized_hybrid_ensemble.py
preprocess_data.py	preprocess_data.py
pyproject.toml	pyproject.toml
setup.py	setup.py
train_unified.py	train_unified.py
xgb_hyperparam_search.py	xgb_hyperparam_search.py

otk: ecDNA Analysis Toolkit

PyPI version License Python

otk (ecDNA Analysis Toolkit) is a machine learning toolkit for predicting extrachromosomal DNA (ecDNA) cargo genes. It classifies genes at the gene level (ecDNA cargo vs. non-ecDNA) and identifies focal amplification types at the sample level (nofocal, noncircular, circular/ecDNA).

Based on the paper: Wang, S., et al. (2024). Machine learning-based extrachromosomal DNA identification in large-scale cohorts reveals its clinical implications in cancer. Nature Communications.

Core Features

Deep learning-based ecDNA cargo gene prediction at gene level
Sample-level focal amplification type classification (nofocal/noncircular/circular)
Multiple pre-trained models (XGBoost, Neural Networks, TabPFN)
Efficient command-line interface for training and prediction
GPU acceleration support
Pre-trained models ready to use after pip install
RESTful API for web service deployment
Chinese mirror support for large model downloads

Installation

From PyPI (Recommended)

pip install otk-ecdna

This installs the otk CLI command and all pre-trained models (except TabPFN which is ~275MB and needs separate download).

Download Large Models

The TabPFN model (~275MB) is hosted on GitHub Release:

# List available large models
otk download --list
# Download TabPFN model
otk download --model tabpfn

From Source

git clone https://github.com/WangLabCSU/otk.git
cd otk/otk
pip install -e .

Quick Start

# Check installation
otk --version
# List available models
otk models
# Run prediction (example)
otk predict --input data.csv --output predictions.csv --model xgb_new
# Start API server
otk api --port 8000

CLI Commands

Model Management

# List all available models with performance metrics
otk models
# Analyze a specific model
otk analyze --model xgb_new
# Generate model configuration
otk config generate --model xgb_new

Training

# Train single model
otk train --model xgb_new --gpu 0
# Train neural network model
otk train --model transformer --gpu 0
# Train all models sequentially
otk train --all --gpu 0
# Train all models in parallel on multiple GPUs
otk train --all --parallel --gpus 0,1,2,3
# CPU-only training
otk train --model xgb_new --gpu -1

Prediction

# Basic prediction
otk predict --input data.csv --output predictions.csv --model xgb_new
# With GPU acceleration
otk predict -i data.csv -o results/ -m transformer --gpu 0
# With custom threshold
otk predict -i data.csv -o predictions.csv -m xgb_new --threshold 0.5

API Server

# Start API with default settings (base path /otk)
otk api
# Custom port
otk api --port 8080
# Serve at root (no base path)
otk api --base-path ""
# Development mode with auto-reload
otk api --reload
# Multiple workers
otk api --workers 4

Model Download

# List large models requiring download
otk download --list
# Download TabPFN model
otk download --model tabpfn
# Force re-download
otk download --model tabpfn --force

Data Format

Input Data Format

Input data should be in CSV format.

Minimal required columns:

Column	Description
`sample`	Tumor sample ID
`gene_id`	Gene ID (e.g., ENSG00000284662)
`segVal`	Total gene copy number

Auto-filled columns (defaults applied if missing):

Column	Default	Description
`minor_cn`	0	Minor copy number
`intersect_ratio`	1.0	Segment-gene overlap ratio
`purity`	0.8	Tumor purity
`ploidy`	2.0	Genome ploidy
`AScore`	10.0	Aneuploidy score
`pLOH`	0.1	LOH proportion
`cna_burden`	0.2	CNA burden
`CN1-CN19`	0.05 each	Copy number signatures
`type`	-	Cancer type → auto-converts to `type_*` columns

Automatically generated features (from gene_id matching):

Column	Description
`freq_Linear`	Prior frequency in linear amplifications
`freq_BFB`	Prior frequency in BFB events
`freq_Circular`	Prior frequency in ecDNA
`freq_HR`	Prior frequency in HR events

Training data requires:

Column	Description
`y`	Binary label (1=ecDNA cargo gene, 0=not)

Supported cancer types (24): BLCA, BRCA, CESC, COAD, DLBC, ESCA, GBM, HNSC, KICH, KIRC, KIRP, LGG, LIHC, LUAD, LUSC, OV, PRAD, READ, SARC, SKCM, STAD, THCA, UCEC, UVM

Output Format

Column	Description
`sample`	Sample ID
`gene_id`	Gene ID
`prediction_prob`	Probability of ecDNA (0-1)
`prediction`	Binary classification (0/1)
`sample_level_prediction_label`	Sample type: nofocal/noncircular/circular
`sample_level_prediction`	Sample type code (0/1/2)

Sample classification rules:

circular (2): Any gene predicted as ecDNA cargo
noncircular (1): No ecDNA but segVal > ploidy + 2
nofocal (0): Otherwise

Available Models

Model	Type	Test auPRC	Description
xgb_new	XGBoost	0.8339	Optimized with feature engineering
tabpfn	TabPFN	0.8323	TabPFN ensemble (~275MB, needs download)
deep_residual	Neural	0.8132	Deep residual network
xgb_tuned	XGBoost	0.8065	Hyperparameter tuned
optimized_residual	Neural	0.7906	Optimized residual network
baseline_mlp	Neural	0.7663	Simple MLP baseline
dgit_super	Neural	0.7662	Deep gated interaction transformer
xgb_paper	XGBoost	0.7138	Paper reproduction (11 features)
transformer	Neural	0.6875	Transformer architecture

All models use unified 80/10/10 data split with seed=2026 for reproducibility.

API Service

Start a RESTful API for web-based prediction:

# Start API (default base path /otk)
otk api
# Access points:
# - API docs: http://localhost:8000/otk/docs
# - Health: http://localhost:8000/otk/health
# - Web UI: http://localhost:8000/otk/

See otk_api/README.md for full API documentation.

Project Structure

otk/
├── src/otk/ # Core library
│ ├── data/ # Data handling
│ ├── models/ # Model implementations
│ ├── predict/ # Prediction utilities
│ └── cli.py # Command-line interface
├── otk_api/ # FastAPI web service
│ ├── api/ # API implementation
│ ├── models/ # Pre-trained models
│ └── static/ # Performance charts
├── configs/ # Model configurations
└── tests/ # Unit tests

Citation

If you use otk in your research, please cite:

Wang, S., et al. (2024). Machine learning-based extrachromosomal DNA 
identification in large-scale cohorts reveals its clinical implications 
in cancer. Nature Communications.

License

MIT License. See LICENSE file for details.

Contact

Homepage: https://github.com/WangLabCSU/otk
PyPI: https://pypi.org/project/otk-ecdna/
Email: wangshx@csu.edu.cn

Folders and files

Latest commit

History

Repository files navigation

otk: ecDNA Analysis Toolkit

Core Features

Installation

From PyPI (Recommended)

Download Large Models

From Source

Quick Start

CLI Commands

Model Management

Training

Prediction

API Server

Model Download

Data Format

Input Data Format

Output Format

Available Models

API Service

Project Structure

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages