Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

open-sciencelab/GraphGen

Repository files navigation

stars forks open issues issue resolution documentation pypi wechat arXiv Hugging Face

Hugging Face Model Scope OpenXLab

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

English | δΈ­ζ–‡

πŸ“š Table of Contents

πŸ“ What is GraphGen?

GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check the paper and best practice.

Here is post-training result which over 50% SFT data comes from GraphGen and our data clean pipeline.

Domain Dataset Ours Qwen2.5-7B-Instruct (baseline)
Plant SeedBench 65.9 51.5
Common CMMLU 73.6 75.8
Knowledge GPQA-Diamond 40.0 33.3
Math AIME24 20.6 16.7
AIME25 22.7 7.2

It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

After data generation, you can use LLaMA-Factory and xtuner to finetune your LLMs.

πŸ“Œ Latest Updates

  • 2025εΉ΄10月23ζ—₯: We support VQA(Visual Question Answering) data generation now. Run script: bash scripts/generate/generate_vqa.sh.
  • 2025εΉ΄10月21ζ—₯: We support PDF as input format for data generation now via MinerU.
  • 2025εΉ΄09月29ζ—₯: We auto-update gradio demo on Hugging Face and ModelScope.
History
  • 2025εΉ΄08月14ζ—₯: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
  • 2025εΉ΄07月31ζ—₯: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
  • 2025εΉ΄04月21ζ—₯: We have released the initial version of GraphGen.

πŸš€ Quick Start

Experience GraphGen through Web or Backup Web Entrance

For any questions, please check FAQ, open new issue or join our wechat group and ask.

Preparation

  1. Install uv

    # You could try pipx or pip to install uv when meet network issues, refer the uv doc for more details
    curl -LsSf https://astral.sh/uv/install.sh | sh
  2. Clone the repository

    git clone --depth=1 https://github.com/open-sciencelab/GraphGen
    cd GraphGen
  3. Create a new uv environment

     uv venv --python 3.10
  4. Configure the dependencies

    uv pip install -r requirements.txt

Run Gradio Demo

python -m webui.app

For hot-reload during development, run

PYTHONPATH=. gradio webui/app.py

ui

Run from PyPI

  1. Install GraphGen

    uv pip install graphg
  2. Run in CLI

    SYNTHESIZER_MODEL=your_synthesizer_model_name \
    SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \
    SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \
    TRAINEE_MODEL=your_trainee_model_name \
    TRAINEE_BASE_URL=your_base_url_for_trainee_model \
    TRAINEE_API_KEY=your_api_key_for_trainee_model \
    graphg --output_dir cache

Run from Source

  1. Configure the environment

    • Create an .env file in the root directory
      cp .env.example .env
    • Set the following environment variables:
      # Synthesizer is the model used to construct KG and generate data
      SYNTHESIZER_MODEL=your_synthesizer_model_name
      SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model
      SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model
      # Trainee is the model used to train with the generated data
      TRAINEE_MODEL=your_trainee_model_name
      TRAINEE_BASE_URL=your_base_url_for_trainee_model
      TRAINEE_API_KEY=your_api_key_for_trainee_model
  2. (Optional) Customize generation parameters in graphgen/configs/ folder.

    Edit the corresponding YAML file, e.g.:

     # configs/cot_config.yaml
     input_file: resources/input_examples/jsonl_demo.jsonl
     output_data_type: cot
     tokenizer: cl100k_base
     # additional settings...
  3. Generate data

    Pick the desired format and run the matching script:

    Format Script to run Notes
    cot bash scripts/generate/generate_cot.sh Chain-of-Thought Q&A pairs
    atomic bash scripts/generate/generate_atomic.sh Atomic Q&A pairs covering basic knowledge
    aggregated bash scripts/generate/generate_aggregated.sh Aggregated Q&A pairs incorporating complex, integrated knowledge
    multi-hop bash scripts/generate/generate_multihop.sh Multi-hop reasoning Q&A pairs
  4. Get the generated data

    ls cache/data/graphgen

Run with Docker

  1. Build the Docker image
    docker build -t graphgen .
  2. Run the Docker container
     docker run -p 7860:7860 graphgen

πŸ—οΈ System Architecture

See analysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.

Workflow

workflow

πŸ€ Acknowledgements

  • SiliconFlow Abundant LLM API, some models are free
  • LightRAG Simple and efficient graph retrieval solution
  • ROGRAG A robustly optimized GraphRAG framework
  • DB-GPT An AI native data app development framework

πŸ“š Citation

If you find this repository useful, please consider citing our work:

@misc{chen2025graphgenenhancingsupervisedfinetuning,
 title={GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation}, 
 author={Zihong Chen and Wanli Jiang and Jinzhe Li and Zhonghang Yuan and Huanjun Kong and Wanli Ouyang and Nanqing Dong},
 year={2025},
 eprint={2505.20416},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2505.20416}, 
}

πŸ“œ License

This project is licensed under the Apache License 2.0.

πŸ“… Star History

Star History Chart

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /