Final Report: GSoC ’25
Student Name: Jiahui Hu (Lareina)
Organization: National Resource for Network Biology (NRNB)
Mentors: Nantia Leonidou, Prof. Dr. Andreas Dräger
Project: Enhancing SBOannotator with LLM Integration & Dynamic Term
Overview
This project transforms SBOannotator from a static, hard-coded tool into a dynamic, intelligent system for annotating Systems Biology Ontology (SBO) terms in SBML models. The enhanced system integrates:
- Real-time SBO term retrieval
- Multiple enzymology data sources (BiGG, KEGG, Reactome, SEED)
- Fine-tuned LLM-assisted annotation
- Python Runtime GUI and desktop GUI with interactive visualization
These improvements significantly boost accuracy and usability while preserving the core rule-based strengths.
Methods
1) Automated SBO File Management
- Auto-update detection: compare commit SHA at startup.
- Versioning: maintain timestamped local versions and auto-delete sbo file if it exceeded 2 series
- Formats: support .obo and .json with a 4-step validation pipeline.
- User interaction: apply updates, download SBO files, or upload custom SBO files.
- Integrity: round-trip conversion tests to ensure lossless persistence.
2) Three-Layer Rule-Based Annotation Workflow
Layer 1 — Configuration / Strategy
Let users to configure database with order and number
Layer 2 — Adapter Execution
Unified multi-database adapters for identifier extraction and EC-number lookup:
- BiGG (direct API), KEGG (regex + REST), SEED (Solr), Reactome (web parsing + QuickGO)
- Early termination: stop querying once a precise SBO term is found
- EC-number truncation: normalize at first non-digit for consistency
Layer 3 — LLM Filter
Target only reactions needing disambiguation:
- Distinct handling for single vs multiple ECs
- Filter conflicting ECs to avoid ambiguity
- Log EC and fetch context for LLM input
3) Fine-tuned LLM for EC → SBO
- Base model: BioBERT (
dmis-lab/biobert-base-cased-v1.1)
- Architecture: 768D encoder → FC (768→384→42) with Focal Loss (~111M params)
- Two-stage training:
- Stage-1: 331 expert samples, 80 epochs
- Stage-2: 6,966 GPT-generated samples, lower LR, 10 epochs (noise adaptation)
4) GUI Application
- PyQt5 build python runtime GUI with side-by-side pre/post annotation tables and file upload/download
- Packaged via PyInstaller as a macOS DMG (ships the rule-based pipeline)
Results
- SBO updates: switched to direct GitHub download (~1 min per update)
- Coverage: across 108 models, 3,317 reactions upgraded from generic SBO:0000176 to specific terms via multi-database integration
- Efficiency: mean 432.99 s/model (~7.2 min); initial naive multi-DB flow was ~14 h/model; early termination reduced end-to-end time to ~7 min/model
- LLM classification: Top-1 accuracy 94.00% (42 classes); Macro-F1 0.4563, Weighted-F1 0.9184; mean confidence 78.83% (median 81.41%, max 88.21%); automatic fallback to Stage-1 when Stage-2 degrades
- python runtime gui and Dmg app
Constraints & Future Improvement
- Packaging: DMG includes rule-based workflow; LLM features provided via CLI to avoid heavy runtime dependencies
- Data quality control: intelligent filtering of GPT-generated training data
- More expert labels: expand high-quality human annotations for better generalization
- Fine-tune LLM find SBO for EC-less reactions:
Thank You
Thanks to the SBOannotator community and Google Summer of Code for this opportunity. Special thanks to mentors Nantia Leonidou and Andreas Dräger for guidance and support. I will continue to monitor issues and PRs and look forward to future collaborations.
Quick Links
Uh oh!
There was an error while loading. Please reload this page.
Final Report: GSoC ’25
Student Name: Jiahui Hu (Lareina)
Organization: National Resource for Network Biology (NRNB)
Mentors: Nantia Leonidou, Prof. Dr. Andreas Dräger
Project: Enhancing SBOannotator with LLM Integration & Dynamic Term
Overview
This project transforms SBOannotator from a static, hard-coded tool into a dynamic, intelligent system for annotating Systems Biology Ontology (SBO) terms in SBML models. The enhanced system integrates:
These improvements significantly boost accuracy and usability while preserving the core rule-based strengths.
Methods
1) Automated SBO File Management
2) Three-Layer Rule-Based Annotation Workflow
Layer 1 — Configuration / Strategy
Let users to configure database with order and number
Layer 2 — Adapter Execution
Unified multi-database adapters for identifier extraction and EC-number lookup:
Layer 3 — LLM Filter
Target only reactions needing disambiguation:
3) Fine-tuned LLM for EC → SBO
dmis-lab/biobert-base-cased-v1.1)4) GUI Application
Results
Constraints & Future Improvement
Thank You
Thanks to the SBOannotator community and Google Summer of Code for this opportunity. Special thanks to mentors Nantia Leonidou and Andreas Dräger for guidance and support. I will continue to monitor issues and PRs and look forward to future collaborations.
Quick Links