Name	Name	Last commit message	Last commit date
Latest commit History 4 Commits
data	data
resources	resources
scripts_api	scripts_api
tools	tools
utils	utils
.gitignore	.gitignore
README.md	README.md
launch.sh	launch.sh
test_qwen3.py	test_qwen3.py

🤗 Hugging Face Dataset 🌐 Project Page arXiv GitHub License: MIT

📑 Table of Contents

🚀 ChartScope
📦 ChartDQA Benchmark
- 📥 Data download
- 🗂️ Data Structure
📚 Citation
📜 License

🚀 ChartScope

This is the official repository for our paper
📄 "In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding"
🔗 Project site

ChartScope lets you automatically generate synthetic chart data via Qwen3 and easily download the ChartDQA benchmark. Stay tuned for more updates! 🔥

demo

🔍 TL;DR

This repo offers an automated, efficient pipeline powered by a text-only LLM. With a single command, you can generate:

📊 Chart images
🗄️ Raw JSON data
❓ Question–Answer pairs
🐍 Python scripts
📖 Background stories

🚨 Important Updates

July 18, 2025 – Data-generation pipeline & ChartDQA benchmark are now released! 🎉

⚙️ Setup

🖥️ Machine Environment

OS: Ubuntu 24.04.2 LTS
CUDA: 12.6
GPUs: Tested on ×ばつ NVIDIA L40 or ×ばつ NVIDIA H100

🐍 Python Environment

Requires: Python ≥ 3.10

# Core deps
pip install openai pathlib tqdm subprocess joblib threading
pip install -U "huggingface_hub[cli]"
# PyTorch for CUDA 12.6
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 \
 --index-url https://download.pytorch.org/whl/cu126
# Flash attention
pip install flash-attn==2.7.3
# vLLM & transformers
pip install vllm==0.9.0.1 transformers==4.51.3
pip install accelerate einops

Download Models

mkdir model-weights
huggingface-cli download Qwen/Qwen3-32B --local-dir model-weights/Qwen3-32B

Generating High-Quality Chart QAs with Qwen3

Note: Our paper’s data were generated with OpenAI GPT. This pipeline uses open-source Qwen3 for public use. You can also change Qwen3 to GPT by simply specifying GPT_DEPLOY_NAME="gpt-o4-mini" in all files in scripts_api.

1. Launch vllm Server

bash launch.sh
# OR 
vllm serve \
 model-weights/Qwen3-32B/ \
 --tensor-parallel-size 4 \
 --reasoning-parser qwen3 \
 --enable-auto-tool-choice \
 --tool-call-parser hermes \
 --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
 --max-model-len 131072

2. Edit `data/metadata.json` to select which chart types to generate.

3. Generate JSON Template and README via JSON expert.

Please check the GPT version that you are going to use first in script.

python3 scripts_api/generate_json_template.py

4. Generate JSON data and QA via Data expert.

Please check the GPT version that you are going to use first in script.

python3 scripts_api/generate_json_data_and_qa.py

5. Generate Python scripts via Python expert.

Please check the GPT version that you are going to use first in script.

python3 scripts_api/generate_py_script.py

You can run this in parallel with step 4.

6. Validate QA & JSON format

python3 tools/data/check_json_qa_format.py

7. Merge all outputs

python3 tools/data/merge_folders.py

8. Produce chart images

Please adjust the number of worker accordingly in script.

python3 tools/data/generate_chart_image.py

9. Collect your data in "data/final/"

⏱️ Time Cost (Qwen3-32B on 4 L40s)

JSON data and qa generation: 2.2 min for one chart type and one pair Python script genearation: 10 min for one chart type and one script

Task	Time per chart type
Template generation	4.3 min
JSON data & QA generation	2.2 min / per pair
Python script generation	10 min / per script

📦 ChartDQA Benchmark

📥 Data download

Hugging Face Dataset

🗂️ Data Structure

We provide two annotation formats—JSON and JSONL—with identical QA pairs.

Use test.json for full evaluation and test_small.json for a quick run on 1,000 sampled QA pairs.

>ChartDQA
├── data
│ └── Area_Chart/
│ │ └── chart/
│ │ └── 000000_script_matplotlib_0.png
│ │ └── ...
│ │ └── csv/
│ │ └── 000000.csv
│ │ └── ...
│ │ └── json/
│ │ └── 000000.json
│ │ └── ...
│ │ └── qa/
│ │ └── 000000.json
│ │ └── ...
│ └── Bar_Chart/
│ └── Box_Plot/
│ └── ...
└── test.json
└── test.jsonl
└── test_small.json
└── test_small.jsonl

📚 Citation

If you find ChartScope useful, please cite:

@inproceedings{fan2025chartscope,
 title={On pre-training of multimodal language models customized for chart understanding},
 author={Fan, Wan-Cyuan and Chen, Yen-Chun and Liu, Mengchen and Jacobson, Alexander and Yuan, Lu and Sigal, Leonid},
 booktitle={NeurIPS Workshop on Adaptive Foundation Models},
 year={2024}
}

📜 License

This project is licensed under the MIT License.

davidhalladay/ChartScope

Folders and files

Latest commit

History

Repository files navigation

📑 Table of Contents

🚀 ChartScope

🔍 TL;DR

🚨 Important Updates

⚙️ Setup

🖥️ Machine Environment

🐍 Python Environment

Download Models

Generating High-Quality Chart QAs with Qwen3

1. Launch vllm Server

2. Edit data/metadata.json to select which chart types to generate.

3. Generate JSON Template and README via JSON expert.

4. Generate JSON data and QA via Data expert.

5. Generate Python scripts via Python expert.

6. Validate QA & JSON format

7. Merge all outputs

8. Produce chart images

9. Collect your data in "data/final/"

⏱️ Time Cost (Qwen3-32B on 4 L40s)

📦 ChartDQA Benchmark

📥 Data download

🗂️ Data Structure

📚 Citation

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

2. Edit `data/metadata.json` to select which chart types to generate.

Packages