π€ Hugging Face Dataset π Project Page arXiv GitHub License: MIT
This is the official repository for our paper
π "In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding"
π Project site
ChartScope lets you automatically generate synthetic chart data via Qwen3 and easily download the ChartDQA benchmark. Stay tuned for more updates! π₯
This repo offers an automated, efficient pipeline powered by a text-only LLM. With a single command, you can generate:
- π Chart images
- ποΈ Raw JSON data
- β QuestionβAnswer pairs
- π Python scripts
- π Background stories
- July 18, 2025 β Data-generation pipeline & ChartDQA benchmark are now released! π
- OS: Ubuntu 24.04.2 LTS
- CUDA: 12.6
- GPUs: Tested on Γγ°γ€ NVIDIA L40 or Γγ°γ€ NVIDIA H100
Requires: Python β₯ 3.10
# Core deps pip install openai pathlib tqdm subprocess joblib threading pip install -U "huggingface_hub[cli]" # PyTorch for CUDA 12.6 pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 \ --index-url https://download.pytorch.org/whl/cu126 # Flash attention pip install flash-attn==2.7.3 # vLLM & transformers pip install vllm==0.9.0.1 transformers==4.51.3 pip install accelerate einops
mkdir model-weights huggingface-cli download Qwen/Qwen3-32B --local-dir model-weights/Qwen3-32B
Note: Our paperβs data were generated with OpenAI GPT. This pipeline uses open-source Qwen3 for public use. You can also change Qwen3 to GPT by simply specifying
GPT_DEPLOY_NAME="gpt-o4-mini"in all files in scripts_api.
bash launch.sh # OR vllm serve \ model-weights/Qwen3-32B/ \ --tensor-parallel-size 4 \ --reasoning-parser qwen3 \ --enable-auto-tool-choice \ --tool-call-parser hermes \ --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \ --max-model-len 131072
Please check the GPT version that you are going to use first in script.
python3 scripts_api/generate_json_template.py
Please check the GPT version that you are going to use first in script.
python3 scripts_api/generate_json_data_and_qa.py
Please check the GPT version that you are going to use first in script.
python3 scripts_api/generate_py_script.py
You can run this in parallel with step 4.
python3 tools/data/check_json_qa_format.py
python3 tools/data/merge_folders.py
Please adjust the number of worker accordingly in script.
python3 tools/data/generate_chart_image.py
JSON data and qa generation: 2.2 min for one chart type and one pair Python script genearation: 10 min for one chart type and one script
| Task | Time per chart type |
|---|---|
| Template generation | 4.3 min |
| JSON data & QA generation | 2.2 min / per pair |
| Python script generation | 10 min / per script |
We provide two annotation formatsβJSON and JSONLβwith identical QA pairs.
Use test.json for full evaluation and test_small.json for a quick run on 1,000 sampled QA pairs.
>ChartDQA
βββ data
β βββ Area_Chart/
β β βββ chart/
β β βββ 000000_script_matplotlib_0.png
β β βββ ...
β β βββ csv/
β β βββ 000000.csv
β β βββ ...
β β βββ json/
β β βββ 000000.json
β β βββ ...
β β βββ qa/
β β βββ 000000.json
β β βββ ...
β βββ Bar_Chart/
β βββ Box_Plot/
β βββ ...
βββ test.json
βββ test.jsonl
βββ test_small.json
βββ test_small.jsonl
If you find ChartScope useful, please cite:
@inproceedings{fan2025chartscope, title={On pre-training of multimodal language models customized for chart understanding}, author={Fan, Wan-Cyuan and Chen, Yen-Chun and Liu, Mengchen and Jacobson, Alexander and Yuan, Lu and Sigal, Leonid}, booktitle={NeurIPS Workshop on Adaptive Foundation Models}, year={2024} }
This project is licensed under the MIT License.