GitHub - open-sciencelab/GraphGen: GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

Name	Name	Last commit message	Last commit date
Latest commit History 711 Commits
.github	.github
baselines	baselines
graphgen	graphgen
resources	resources
scripts	scripts
tests	tests
webui	webui
.env.example	.env.example
.gitignore	.gitignore
.pre-commit-config.yaml	.pre-commit-config.yaml
.pylintrc	.pylintrc
CITATION.cff	CITATION.cff
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md
Dockerfile	Dockerfile
LICENSE	LICENSE
MANIFEST.in	MANIFEST.in
README.md	README.md
README_zh.md	README_zh.md
pyproject.toml	pyproject.toml
requirements-dev.txt	requirements-dev.txt
requirements.txt	requirements.txt
setup.py	setup.py

stars forks open issues issue resolution documentation pypi wechat arXiv Hugging Face

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

📚 Table of Contents

📝 What is GraphGen?
📌 Latest Updates
🚀 Quick Start
🏗️ System Architecture
🍀 Acknowledgements
📚 Citation
📜 License
📅 Star History

📝 What is GraphGen?

GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check the paper and best practice.

Here is post-training result which over 50% SFT data comes from GraphGen and our data clean pipeline.

Domain	Dataset	Ours	Qwen2.5-7B-Instruct (baseline)
Plant	SeedBench	65.9	51.5
Common	CMMLU	73.6	75.8
Knowledge	GPQA-Diamond	40.0	33.3
Math	AIME24	20.6	16.7
AIME25	22.7	7.2

It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

After data generation, you can use LLaMA-Factory and xtuner to finetune your LLMs.

📌 Latest Updates

2025年10月23日: We support VQA(Visual Question Answering) data generation now. Run script: bash scripts/generate/generate_vqa.sh.
2025年10月21日: We support PDF as input format for data generation now via MinerU.
2025年09月29日: We auto-update gradio demo on Hugging Face and ModelScope.

History

2025年08月14日: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
2025年07月31日: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
2025年04月21日: We have released the initial version of GraphGen.

🚀 Quick Start

Experience GraphGen through Web or Backup Web Entrance

For any questions, please check FAQ, open new issue or join our wechat group and ask.

Preparation

Install uv

# You could try pipx or pip to install uv when meet network issues, refer the uv doc for more details
curl -LsSf https://astral.sh/uv/install.sh | sh

Clone the repository

git clone --depth=1 https://github.com/open-sciencelab/GraphGen
cd GraphGen

Create a new uv environment
```
 uv venv --python 3.10
```
Configure the dependencies
```
uv pip install -r requirements.txt
```

Run Gradio Demo

python -m webui.app

For hot-reload during development, run

PYTHONPATH=. gradio webui/app.py

Run from PyPI

Install GraphGen
```
uv pip install graphg
```

Run in CLI

SYNTHESIZER_MODEL=your_synthesizer_model_name \
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \
TRAINEE_MODEL=your_trainee_model_name \
TRAINEE_BASE_URL=your_base_url_for_trainee_model \
TRAINEE_API_KEY=your_api_key_for_trainee_model \
graphg --output_dir cache

Run from Source

Configure the environment

Create an .env file in the root directory
```
cp .env.example .env
```

Set the following environment variables:

# Synthesizer is the model used to construct KG and generate data
SYNTHESIZER_MODEL=your_synthesizer_model_name
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model
# Trainee is the model used to train with the generated data
TRAINEE_MODEL=your_trainee_model_name
TRAINEE_BASE_URL=your_base_url_for_trainee_model
TRAINEE_API_KEY=your_api_key_for_trainee_model

(Optional) Customize generation parameters in graphgen/configs/ folder.

Edit the corresponding YAML file, e.g.:

 # configs/cot_config.yaml
 input_file: resources/input_examples/jsonl_demo.jsonl
 output_data_type: cot
 tokenizer: cl100k_base
 # additional settings...

Generate data

Pick the desired format and run the matching script:

Format	Script to run	Notes
`cot`	`bash scripts/generate/generate_cot.sh`	Chain-of-Thought Q&A pairs
`atomic`	`bash scripts/generate/generate_atomic.sh`	Atomic Q&A pairs covering basic knowledge
`aggregated`	`bash scripts/generate/generate_aggregated.sh`	Aggregated Q&A pairs incorporating complex, integrated knowledge
`multi-hop`	`bash scripts/generate/generate_multihop.sh`	Multi-hop reasoning Q&A pairs

Get the generated data
```
ls cache/data/graphgen
```

Run with Docker

Build the Docker image
```
docker build -t graphgen .
```
Run the Docker container
```
 docker run -p 7860:7860 graphgen
```

🏗️ System Architecture

See analysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.

Workflow

workflow

🍀 Acknowledgements

SiliconFlow Abundant LLM API, some models are free
LightRAG Simple and efficient graph retrieval solution
ROGRAG A robustly optimized GraphRAG framework
DB-GPT An AI native data app development framework

📚 Citation

If you find this repository useful, please consider citing our work:

@misc{chen2025graphgenenhancingsupervisedfinetuning,
 title={GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation}, 
 author={Zihong Chen and Wanli Jiang and Jinzhe Li and Zhonghang Yuan and Huanjun Kong and Wanli Ouyang and Nanqing Dong},
 year={2025},
 eprint={2505.20416},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2505.20416}, 
}

📜 License

This project is licensed under the Apache License 2.0.

📅 Star History

Star History Chart

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

open-sciencelab/GraphGen

Folders and files

Latest commit

History

Repository files navigation

📝 What is GraphGen?

📌 Latest Updates

🚀 Quick Start

Preparation

Run Gradio Demo

Run from PyPI

Run from Source

Run with Docker

🏗️ System Architecture

Workflow

🍀 Acknowledgements

📚 Citation

📜 License

📅 Star History

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Contributors 8

Uh oh!

Languages