Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

BMGLab/PeakATail

Repository files navigation

PeakATail

PeakATail — coiled snake with poly(A) tail

PeakATail is a Python tool for single-cell poly(A) site (PAS) detection and alternative polyadenylation (APA) analysis. It works with any scRNA-seq BAM that carries CB:Z (corrected cell barcode) and UB:Z (corrected UMI) tags — STARsolo, CellRanger, Alevin-fry, or any aligner that emits the standard 10x-style tag schema. PeakATail does NOT correct barcodes; your aligner must apply a barcode whitelist (e.g. STARsolo's --soloCBwhitelist). From the input BAM it calls polyadenylation sites at the read level, builds a per-cell PAS count matrix, clusters cells by their APA profiles, and provides downstream analyses including differential APA testing between clusters, 3'UTR length quantification, and cross-dataset cluster matching. Developed at BMGLab.

Docs License: MIT Python 3.11+


Quick install

System prerequisites

PeakATail calls samtools and bedtools as subprocesses. Both must be on your PATH before running the pipeline.

# Debian/Ubuntu
sudo apt-get install samtools bedtools
# macOS (Homebrew)
brew install samtools bedtools

The code has been tested against samtools >= 1.10 (enforced at runtime in ema/datasets/manager.py).

Python version

Python 3.11 or later is required (declared in pyproject.toml as requires-python = ">=3.11").

Install from source

git clone https://github.com/BMGLab/PeakATail.git
cd PeakATail
# Create and activate a virtual environment
python3.11 -m venv .venv
source .venv/bin/activate
# Install with uv (recommended — matches the development environment)
pip install uv
uv pip install -e .
# Or with plain pip
pip install -e .

All Python dependencies are declared in pyproject.toml and installed automatically.


Quick start

# Step 1: copy the example config and edit paths
cp example.yaml my_run.yaml
# Edit my_run.yaml — set datasets[].bams, gtf, seqlen, cb_len, barcode_tag
# Step 2: run the full pipeline (peak calling → annotation → clustering)
uv run ema run --config my_run.yaml --threads 4
# Step 3: differential APA between every cluster pair
uv run ema switch diff \
 -i peakatail_runs/<run>/per_dataset/<ds>/clusters.h5ad \
 --pasbed peakatail_runs/<run>/per_dataset/<ds>/pasbed.bed \
 --strategy fisher
# Step 4: 3'UTR length quantification across clusters
uv run ema switch length \
 -i peakatail_runs/<run>/per_dataset/<ds>/clusters.h5ad \
 --strategy classic

The pipeline writes all output under peakatail_runs/<name>_<timestamp>/. A typical run directory looks like:

peakatail_runs/emaout_20240501_143022/
 run_config.json # full resolved parameters
 peakatail_<ts>.log # structured run log
 per_dataset/
 <dataset_id>/
 raw/ # pre-filter BEDs and matrices
 posbed.bed # filtered positive-strand PAS BED
 negbed.bed # filtered negative-strand PAS BED
 pasbed.bed # combined filtered PAS BED (both strands)
 filtered_cb.tsv # barcodes that passed min_read filter
 pas_gene.tsv # PAS-to-gene mapping table
 annotatedpas.bed # pasbed.bed extended with gene_id column
 annotated_matrix.mtx # post-annotation count matrix (MatrixMarket)
 annotated_pas_ids.tsv # row index for annotated_matrix.mtx
 annotated_cells.tsv # column index for annotated_matrix.mtx
 preprocessed.h5ad # filtered AnnData before clustering
 clusters.h5ad # AnnData with leiden cluster labels
 switch_diff_<ts>/ # output of `ema switch diff` (auto-routed)
 switch_length_<ts>/ # output of `ema switch length` (auto-routed)
 switch_match_<ts>/ # output of `ema switch match` (auto-routed)
 switch_geneview_<ts>/ # output of `ema switch geneview` (auto-routed)

What PeakATail produces

Peak calling stage

File Location Description
raw/pos.bed per_dataset/<ds>/raw/ Unfiltered positive-strand PAS calls (concatenated across BAM replicates)
raw/neg.bed per_dataset/<ds>/raw/ Unfiltered negative-strand PAS calls
raw/pas.bed per_dataset/<ds>/raw/ Union of raw pos + neg BED (pre-filter)
raw/pos.mtx per_dataset/<ds>/raw/ Raw count matrix, positive strand (MatrixMarket)
raw/neg.mtx per_dataset/<ds>/raw/ Raw count matrix, negative strand (MatrixMarket)
raw/cb.tsv per_dataset/<ds>/raw/ Raw cell-barcode index aligned to raw MTX columns
posbed.bed per_dataset/<ds>/ Filtered positive-strand PAS BED
negbed.bed per_dataset/<ds>/ Filtered negative-strand PAS BED
pasbed.bed per_dataset/<ds>/ Combined filtered PAS BED (both strands); used as input for downstream commands
filtered_cb.tsv per_dataset/<ds>/ Barcodes that passed the min_read filter, with filter threshold in header

Annotation stage

File Location Description
pas_gene.tsv per_dataset/<ds>/ Two-column table: pas_id, gene_id
annotatedpas.bed per_dataset/<ds>/ pasbed.bed extended with a trailing gene_id column
annotated_matrix.mtx per_dataset/<ds>/ MatrixMarket sparse count matrix (rows = annotated PAS, cols = cells)
annotated_pas_ids.tsv per_dataset/<ds>/ Row index for annotated_matrix.mtx
annotated_cells.tsv per_dataset/<ds>/ Column index for annotated_matrix.mtx

Clustering stage

File Location Description
preprocessed.h5ad per_dataset/<ds>/ AnnData after cell/PAS filtering, before cluster labels are assigned
clusters.h5ad per_dataset/<ds>/ AnnData with leiden cluster labels in .obs; primary input for all switch subcommands

Differential APA stage (ema switch diff)

File Location Description
diff_<c1>_vs_<c2>.tsv switch_diff_<ts>/ Per-cluster-pair differential APA results table
cluster_match.tsv switch_match_<ts>/ Cross-dataset cluster correspondence scores (ema switch match)
pdui_classic.tsv switch_length_<ts>/ Per-cell PDUI scores when strategy is classic
proportion.tsv switch_length_<ts>/ Per-cell per-PAS proportion scores when strategy is proportion
entropy_shannon.tsv switch_length_<ts>/ Per-cell Shannon entropy scores when strategy is shannon

Available CLI commands

All commands are accessed through the ema entry point installed by pip install -e ..

Command Description Docs
ema run Run the full pipeline: peak calling, annotation, clustering cli/run
ema switch diff Differential APA test across cluster pairs (Fisher / NB regression) cli/switch/diff
ema switch length 3'UTR shortening/lengthening quantification (PDUI variants) cli/switch/length
ema switch match Cross-dataset cluster matching cli/switch/match
ema switch geneview Gene-track visualisation: per-cluster PAS coverage and proportions cli/switch/geneview
ema merge Merge multiple BAM files into one sorted and indexed BAM cli/merge
ema parse-gtf Pre-warm the GTF cache so subsequent runs start immediately cli/parse-gtf
ema wizard Interactive setup wizard (also invoked by bare ema) cli/wizard

Use --list-strategies on ema run, ema switch diff, ema switch length, and ema switch match to see the strategies registered in the current installation.

Every command accepts --help for full flag documentation.


Available strategies

Peak calling (ema run --peak-strategy)

Name Description
original Wraps the original pasfind() logic unchanged; used as a baseline for comparison
lambda_poisson MACS2-style: Poisson p-value against a local lambda estimated from floor positions
sierra_iterative Sierra-style iterative peak subtraction; finds multiple PAS per UTR
lambda_gradient Local-lambda estimation combined with gradient-based peak delineation

Clustering (ema run --cluster-method)

Name Description
leiden_tfidf TF-IDF normalisation followed by LSI dimensionality reduction and Leiden community detection
leiden_libsize Library-size normalisation followed by PCA and Leiden community detection
external Load pre-computed cluster labels from a file instead of running clustering

Cross-dataset cluster matching (ema switch match --strategy / ema run --match-method)

Name Description
marker_overlap Match clusters by overlap of top marker PAS sets
jaccard Match clusters by Jaccard similarity of cell sets
mnn Mutual nearest neighbours in a shared LSI embedding

3'UTR length quantification (ema switch length --strategy)

Name Output file Description
classic pdui_classic.tsv Classic 2-PAS PDUI: distal_count / (proximal + distal); proximal/distal selected by genomic or transcript-coordinate rank
proportion proportion.tsv Per-PAS proportion vector: reads per PAS as fraction of gene total per cell
shannon entropy_shannon.tsv Shannon entropy of the per-PAS proportion distribution per (gene, cell)

Differential APA testing (ema switch diff --strategy)

Name Description
fisher Fisher's exact test on per-PAS read counts between two cluster groups
nb_pairwise Negative binomial regression, pairwise cluster comparison
nb_multi Negative binomial regression, multi-condition

Documentation

Full docs: https://bmglab.github.io/PeakATail/

The MkDocs documentation site is built automatically from the develop and main branches via the .github/workflows/docs.yml GitHub Actions workflow and published to GitHub Pages.


Repository layout

PeakATail/
 ema/ # core Python package (entry point: ema.cli:main)
 cli/ # Click subcommands and config schema
 strategies/ # peak-calling strategy registry
 clustering/ # clustering strategies and cross-dataset matching
 quantification/ # PDUI / proportion / entropy strategies
 switch_test/ # differential APA testing strategies
 outputs.py # all file I/O for the pipeline
 main.py # pipeline orchestration
 downstream_runner.py# per-dataset worker (safe for multiprocessing)
 data/ # example data files
 test/ # test suite (pytest)
 other_repos/ # reference implementations (Sierra, SCAPE, scTail, etc.)
 example.yaml # canonical YAML config template
 combined_polya_scrna_methods.csv # comparison of poly(A) scRNA methods
 pyproject.toml # package metadata, dependencies, entry points
 ROADMAP.md # development and publication roadmap
 LICENSE # MIT License
 Dockerfile # container build

Citation

@software{peakatail,
 author = {Amiri Tabat, Amir},
 title = {{PeakATail}: single-cell poly(A) site detection and APA analysis},
 url = {https://github.com/BMGLab/PeakATail},
 version = {0.2.0},
 note = {TODO: replace with journal citation when available}
}

License

This project is licensed under the MIT License — see the LICENSE file for the full text. pyproject.toml declares the matching SPDX identifier (license = "MIT").

About

Single-cell APA peak calling and poly(A) tail analysis from scRNA-seq. Detects polyadenylation sites, quantifies 3'UTR length switching, and compares APA usage across cell clusters and samples.

Topics

Resources

License

Stars

Watchers

Forks

Packages

Contributors

Languages

AltStyle によって変換されたページ (->オリジナル) /