Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions.

License

Notifications You must be signed in to change notification settings

QingAnLiu/VISTA-Bench

Repository files navigation

VISTA-Bench

VISTA-Bench motivation

Do VLMs really understand visualized text as well as pure text?
A systematic benchmark spanning multimodal perception → reasoning → unimodal knowledge.

License Dataset Tasks Eval

Figure 1: Motivation of VISTA-Bench.


Overview

We introduce VISTA-Bench, a systematic benchmark spanning multimodal perception, reasoning, and unimodal understanding. It evaluates visualized text understanding by contrasting pure-text and visualized-text (VT) questions under controlled rendering conditions.


Dataset at a glance

  • Size: 1,500 instances
  • Composition: predominantly multiple-choice questions (MCQ), with a small portion of open-ended queries
  • Task taxonomy (4 primary categories):
    • Unimodal Knowledge: 500
    • Multimodal Knowledge: 400
    • Multimodal Perception: 300
    • Multimodal Reasoning: 300

VISTA-Bench ability dimensions

Figure 2: Ability dimensions and task taxonomy of VISTA-Bench.


Qualitative example

VISTA-Bench uni-knowledge case

VISTA-Bench multi-reasoning case

Figure 3: A representative multimodal perception case under the visualized-text interface.


Repository structure

VISTA-Bench/
├─ assets/figures/ # figures used in this README
├─ images/ # original images (for multimodal instances)
├─ questions/ # rendered question/option images (VT setting)
├─ VLMEvalKit/ # evaluation toolkit
├─ VISTA-Bench.tsv # dataset index (currently identical to the VT variant)
└─ VISTA-Bench-VT.tsv # dataset index (currently identical; kept for compatibility)

Note: VISTA-Bench.tsv and VISTA-Bench-VT.tsv are currently identical; we keep both filenames for compatibility and will refine the organization later.


Data format

  • images/: original images used by multimodal instances
  • questions/: rendered question/option images for the visualized-text (VT) setting
  • *.tsv: dataset metadata and file paths used for evaluation

Evaluation (VLMEvalKit)

We evaluate VISTA-Bench with VLMEvalKit/.
Before running evaluation, we recommend converting the TSV file(s) into a normalized format with absolute image paths.

1) Convert TSV to normalized paths

Helper script:

  • VISTA-Bench/VLMEvalKit/utils/convert_data_file.py

What it does:

  • converts the TSV encoding to UTF-8
  • normalizes path separators (\/)
  • renames options_A/B/C/DA/B/C/D when needed
  • converts image_path and question_image_path into a bracket-style multi-path string

Example:

python VISTA-Bench/VLMEvalKit/utils/convert_data_file.py \
 --in VISTA-Bench/VISTA-Bench.tsv \
 --out VISTA-Bench/VISTA-Bench_norm.tsv \
 --image-prefix /ABS/PATH/TO/VISTA-Bench \
  • --in: input TSV path (the original dataset TSV to be converted)
  • --out: output TSV path (the converted/normalized TSV produced by this script)
  • --image-prefix: the dataset root directory where images/, questions/, and the TSV files are located (used to resolve relative paths)

After conversion, rename the TSV column header image to image-1 to avoid an AssertionError in some VLMEvalKit setups.

2) Run evaluation

Pure-text:

python /VISTA-Bench/VLMEvalKit/run.py \
 --data VISTA-Bench_norm \
 --model llava_v1.5_7b \
 --verbose

Visualized-text (VT):

python /VISTA-Bench/VLMEvalKit/run.py \
 --data VISTA-Bench-VT \
 --model llava_v1.5_7b \
 --verbose
  • --data: the dataset name corresponding to your converted TSV (e.g., VISTA-Bench_norm). The VT split (VISTA-Bench-VT) should also be converted and can be run directly using this name because it is registered in VLMEvalKit/vlmeval/dataset/image_mm_mcq.py via DATASET_URL/MD5.
  • --model: the model name defined in VLMEvalKit/vlmeval/config.py (make sure the corresponding weights are available in your environment).
  • --verbose: print detailed logs during evaluation.
  • Outputs: the final report includes overall results and l1-categories breakdown.

Results

Modality Comparison (VT vs. Text)
Model Multimodal Multimodal Multimodal Unimodal Overall ↓ Gap
Perception Reasoning Knowledge Knowledge
VT Text VT Text VT Text VT Text VT Text
▼ Vision-Language Models (2B)
DeepSeek-VL2-Tiny 44.3 64.0 31.3 43.7 27.5 28.8 27.6 41.8 31.7 43.1 ↓-11.4
Qwen3-VL-2B-Instruct 51.3 69.0 32.7 49.7 17.5 24.0 37.2 52.0 33.9 47.5 ↓-13.6
Ovis2-2B 58.3 66.7 39.7 52.3 27.0 31.2 36.0 50.2 38.8 48.9 ↓-10.1
NEO-2B-SFT 40.0 68.3 31.3 49.3 25.3 37.3 29.4 53.4 30.8 51.3 ↓-20.5
Qwen2.5-VL-3B-Instruct 65.0 67.7 43.3 54.3 32.5 35.8 54.8 56.6 48.6 52.8 ↓-4.2
InternVL3.5-2B 56.0 66.3 39.3 50.3 30.5 39.8 44.8 57.0 42.1 52.9 ↓-10.8
SAIL-VL2-2B 65.3 69.7 47.3 57.7 32.2 39.0 43.6 54.8 45.7 54.1 ↓-8.4
Ovis2.5-2B 66.3 69.7 51.7 58.0 28.5 39.5 51.8 60.0 48.5 56.1 ↓-7.6
▼ Vision-Language Models (8B)
LLaVA-1.5-7B 33.0 58.7 27.3 44.3 26.5 27.3 24.0 48.8 27.1 44.1 ↓-17.0
LLaVA-OneVision-7B 40.3 66.0 27.0 56.3 20.0 35.3 27.0 58.6 27.8 53.4 ↓-25.6
Qwen2.5-VL-7B-Instruct 65.7 65.3 52.7 53.0 27.0 35.3 62.4 62.0 51.7 53.7 ↓-2.0
MiniCPM-V-4\_5 64.3 71.6 45.7 60.3 31.5 36.0 50.4 55.0 47.2 54.3 ↓-7.1
Qwen3-VL-8B-Instruct 65.3 67.3 49.0 49.3 37.5 44.5 57.8 68.2 52.1 57.9 ↓-5.8
Ovis2-8B 66.7 71.0 47.7 60.0 29.0 39.8 50.8 65.4 47.5 58.6 ↓-11.1
InternVL3.5-8B 61.3 64.3 45.7 52.3 35.3 44.5 57.6 71.2 50.0 58.9 ↓-8.9
MiMo-VL-7B-RL 70.3 69.0 58.3 61.3 40.5 38.3 69.0 68.8 59.5 59.2 ↑+0.3
NEO-9B-SFT 32.7 69.0 29.0 58.0 24.8 40.5 28.6 69.2 28.5 59.3 ↓-30.8
LLaVA-OneVision-1.5-8B 62.7 68.7 46.3 59.0 33.5 42.5 57.4 67.8 49.9 59.5 ↓-9.6
MiMo-VL-7B-SFT 68.3 69.3 61.3 62.7 40.5 41.0 70.6 72.0 60.3 61.3 ↓-1.0
SAIL-VL2-8B 68.7 70.0 54.3 60.7 37.8 44.8 58.0 71.0 54.0 61.7 ↓-7.7
Ovis2.5-9B 68.3 69.0 56.3 65.3 38.8 52.0 66.0 73.8 57.3 65.3 ↓-8.0
GLM-4.1V-9B-Thinking 70.7 71.3 58.7 61.7 50.0 52.5 73.8 75.8 63.8 65.9 ↓-2.1
▼ Vision-Language Models (30B-A3B)
Kimi-VL-A3B-Thinking 69.0 71.0 49.7 61.7 30.8 33.5 52.6 66.4 49.4 57.6 ↓-8.2
Qwen3-VL-30B-A3B-Instruct 64.3 71.0 51.0 60.0 33.0 44.0 54.2 71.0 49.9 61.6 ↓-11.7
InternVL3.5-30B-A3B 64.3 70.3 50.3 61.7 40.8 50.3 61.8 75.2 54.4 64.9 ↓-10.5

Caption: Comparison of different VLMs on our benchmark. Results are reported under Visualized Text (VT) and Text inputs for each metric. The best result per column is bolded. The ↓Gap column denotes the overall performance drop when switching from Text to Visualized Text. All metrics are reported as percentages (%).

Contact

Citation

To be added.


License

To be added.

About

We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

AltStyle によって変換されたページ (->オリジナル) /