Name	Name	Last commit message	Last commit date
Latest commit History 35 Commits
VLMEvalKit	VLMEvalKit
assets/figures	assets/figures
images	images
questions	questions
LICENSE	LICENSE
README.md	README.md
VISTA-Bench-VT.tsv	VISTA-Bench-VT.tsv
VISTA-Bench.tsv	VISTA-Bench.tsv

VISTA-Bench

VISTA-Bench motivation

Do VLMs really understand visualized text as well as pure text?
A systematic benchmark spanning multimodal perception → reasoning → unimodal knowledge.

License Dataset Tasks Eval

Figure 1: Motivation of VISTA-Bench.

Overview

We introduce VISTA-Bench, a systematic benchmark spanning multimodal perception, reasoning, and unimodal understanding. It evaluates visualized text understanding by contrasting pure-text and visualized-text (VT) questions under controlled rendering conditions.

Dataset at a glance

Size: 1,500 instances
Composition: predominantly multiple-choice questions (MCQ), with a small portion of open-ended queries
Task taxonomy (4 primary categories):
- Unimodal Knowledge: 500
- Multimodal Knowledge: 400
- Multimodal Perception: 300
- Multimodal Reasoning: 300

VISTA-Bench ability dimensions

Figure 2: Ability dimensions and task taxonomy of VISTA-Bench.

Qualitative example

VISTA-Bench uni-knowledge case

VISTA-Bench multi-reasoning case

Figure 3: A representative multimodal perception case under the visualized-text interface.

Repository structure

VISTA-Bench/
├─ assets/figures/ # figures used in this README
├─ images/ # original images (for multimodal instances)
├─ questions/ # rendered question/option images (VT setting)
├─ VLMEvalKit/ # evaluation toolkit
├─ VISTA-Bench.tsv # dataset index (currently identical to the VT variant)
└─ VISTA-Bench-VT.tsv # dataset index (currently identical; kept for compatibility)

Note: VISTA-Bench.tsv and VISTA-Bench-VT.tsv are currently identical; we keep both filenames for compatibility and will refine the organization later.

Data format

images/: original images used by multimodal instances
questions/: rendered question/option images for the visualized-text (VT) setting
*.tsv: dataset metadata and file paths used for evaluation

Evaluation (VLMEvalKit)

We evaluate VISTA-Bench with VLMEvalKit/.
Before running evaluation, we recommend converting the TSV file(s) into a normalized format with absolute image paths.

1) Convert TSV to normalized paths

Helper script:

VISTA-Bench/VLMEvalKit/utils/convert_data_file.py

What it does:

converts the TSV encoding to UTF-8
normalizes path separators (\ → /)
renames options_A/B/C/D → A/B/C/D when needed
converts image_path and question_image_path into a bracket-style multi-path string

Example:

python VISTA-Bench/VLMEvalKit/utils/convert_data_file.py \
 --in VISTA-Bench/VISTA-Bench.tsv \
 --out VISTA-Bench/VISTA-Bench_norm.tsv \
 --image-prefix /ABS/PATH/TO/VISTA-Bench \

--in: input TSV path (the original dataset TSV to be converted)
--out: output TSV path (the converted/normalized TSV produced by this script)
--image-prefix: the dataset root directory where images/, questions/, and the TSV files are located (used to resolve relative paths)

After conversion, rename the TSV column header image to image-1 to avoid an AssertionError in some VLMEvalKit setups.

2) Run evaluation

Pure-text:

python /VISTA-Bench/VLMEvalKit/run.py \
 --data VISTA-Bench_norm \
 --model llava_v1.5_7b \
 --verbose

Visualized-text (VT):

python /VISTA-Bench/VLMEvalKit/run.py \
 --data VISTA-Bench-VT \
 --model llava_v1.5_7b \
 --verbose

--data: the dataset name corresponding to your converted TSV (e.g., VISTA-Bench_norm). The VT split (VISTA-Bench-VT) should also be converted and can be run directly using this name because it is registered in VLMEvalKit/vlmeval/dataset/image_mm_mcq.py via DATASET_URL/MD5.
--model: the model name defined in VLMEvalKit/vlmeval/config.py (make sure the corresponding weights are available in your environment).
--verbose: print detailed logs during evaluation.
Outputs: the final report includes overall results and l1-categories breakdown.

Results

Modality Comparison (VT vs. Text)
Model	Multimodal		Multimodal		Multimodal		Unimodal		Overall		↓ Gap
	Perception		Reasoning		Knowledge		Knowledge		Overall
	VT	Text	VT	Text	VT	Text	VT	Text	VT	Text
▼ Vision-Language Models (2B)
DeepSeek-VL2-Tiny	44.3	64.0	31.3	43.7	27.5	28.8	27.6	41.8	31.7	43.1	↓-11.4
Qwen3-VL-2B-Instruct	51.3	69.0	32.7	49.7	17.5	24.0	37.2	52.0	33.9	47.5	↓-13.6
Ovis2-2B	58.3	66.7	39.7	52.3	27.0	31.2	36.0	50.2	38.8	48.9	↓-10.1
NEO-2B-SFT	40.0	68.3	31.3	49.3	25.3	37.3	29.4	53.4	30.8	51.3	↓-20.5
Qwen2.5-VL-3B-Instruct	65.0	67.7	43.3	54.3	32.5	35.8	54.8	56.6	48.6	52.8	↓-4.2
InternVL3.5-2B	56.0	66.3	39.3	50.3	30.5	39.8	44.8	57.0	42.1	52.9	↓-10.8
SAIL-VL2-2B	65.3	69.7	47.3	57.7	32.2	39.0	43.6	54.8	45.7	54.1	↓-8.4
Ovis2.5-2B	66.3	69.7	51.7	58.0	28.5	39.5	51.8	60.0	48.5	56.1	↓-7.6
▼ Vision-Language Models (8B)
LLaVA-1.5-7B	33.0	58.7	27.3	44.3	26.5	27.3	24.0	48.8	27.1	44.1	↓-17.0
LLaVA-OneVision-7B	40.3	66.0	27.0	56.3	20.0	35.3	27.0	58.6	27.8	53.4	↓-25.6
Qwen2.5-VL-7B-Instruct	65.7	65.3	52.7	53.0	27.0	35.3	62.4	62.0	51.7	53.7	↓-2.0
MiniCPM-V-4\_5	64.3	71.6	45.7	60.3	31.5	36.0	50.4	55.0	47.2	54.3	↓-7.1
Qwen3-VL-8B-Instruct	65.3	67.3	49.0	49.3	37.5	44.5	57.8	68.2	52.1	57.9	↓-5.8
Ovis2-8B	66.7	71.0	47.7	60.0	29.0	39.8	50.8	65.4	47.5	58.6	↓-11.1
InternVL3.5-8B	61.3	64.3	45.7	52.3	35.3	44.5	57.6	71.2	50.0	58.9	↓-8.9
MiMo-VL-7B-RL	70.3	69.0	58.3	61.3	40.5	38.3	69.0	68.8	59.5	59.2	↑+0.3
NEO-9B-SFT	32.7	69.0	29.0	58.0	24.8	40.5	28.6	69.2	28.5	59.3	↓-30.8
LLaVA-OneVision-1.5-8B	62.7	68.7	46.3	59.0	33.5	42.5	57.4	67.8	49.9	59.5	↓-9.6
MiMo-VL-7B-SFT	68.3	69.3	61.3	62.7	40.5	41.0	70.6	72.0	60.3	61.3	↓-1.0
SAIL-VL2-8B	68.7	70.0	54.3	60.7	37.8	44.8	58.0	71.0	54.0	61.7	↓-7.7
Ovis2.5-9B	68.3	69.0	56.3	65.3	38.8	52.0	66.0	73.8	57.3	65.3	↓-8.0
GLM-4.1V-9B-Thinking	70.7	71.3	58.7	61.7	50.0	52.5	73.8	75.8	63.8	65.9	↓-2.1
▼ Vision-Language Models (30B-A3B)
Kimi-VL-A3B-Thinking	69.0	71.0	49.7	61.7	30.8	33.5	52.6	66.4	49.4	57.6	↓-8.2
Qwen3-VL-30B-A3B-Instruct	64.3	71.0	51.0	60.0	33.0	44.0	54.2	71.0	49.9	61.6	↓-11.7
InternVL3.5-30B-A3B	64.3	70.3	50.3	61.7	40.8	50.3	61.8	75.2	54.4	64.9	↓-10.5

Caption: Comparison of different VLMs on our benchmark. Results are reported under Visualized Text (VT) and Text inputs for each metric. The best result per column is bolded. The ↓Gap column denotes the overall performance drop when switching from Text to Visualized Text. All metrics are reported as percentages (%).

Contact

Qing'an Liu: 2223884741@mail.dlut.edu.cn
Juntong Feng: 2253762636@mail.dlut.edu.cn

Citation

To be added.

License

To be added.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

QingAnLiu/VISTA-Bench

Folders and files

Latest commit

History

Repository files navigation

VISTA-Bench

Overview

Dataset at a glance

Qualitative example

Repository structure

Data format

Evaluation (VLMEvalKit)

1) Convert TSV to normalized paths

2) Run evaluation

Results

Contact

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors 2

Uh oh!

Languages

License

QingAnLiu/VISTA-Bench

Folders and files

Latest commit

History

Repository files navigation

VISTA-Bench

Overview

Dataset at a glance

Qualitative example

Repository structure

Data format

Evaluation (VLMEvalKit)

1) Convert TSV to normalized paths

2) Run evaluation

Results

Contact

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages