Do VLMs really understand visualized text as well as pure text?
A systematic benchmark spanning multimodal perception → reasoning → unimodal knowledge.
Figure 1: Motivation of VISTA-Bench.
We introduce VISTA-Bench, a systematic benchmark spanning multimodal perception, reasoning, and unimodal understanding. It evaluates visualized text understanding by contrasting pure-text and visualized-text (VT) questions under controlled rendering conditions.
- Size: 1,500 instances
- Composition: predominantly multiple-choice questions (MCQ), with a small portion of open-ended queries
- Task taxonomy (4 primary categories):
- Unimodal Knowledge: 500
- Multimodal Knowledge: 400
- Multimodal Perception: 300
- Multimodal Reasoning: 300
VISTA-Bench ability dimensions
Figure 2: Ability dimensions and task taxonomy of VISTA-Bench.
VISTA-Bench uni-knowledge case
VISTA-Bench multi-reasoning case
Figure 3: A representative multimodal perception case under the visualized-text interface.
VISTA-Bench/
├─ assets/figures/ # figures used in this README
├─ images/ # original images (for multimodal instances)
├─ questions/ # rendered question/option images (VT setting)
├─ VLMEvalKit/ # evaluation toolkit
├─ VISTA-Bench.tsv # dataset index (currently identical to the VT variant)
└─ VISTA-Bench-VT.tsv # dataset index (currently identical; kept for compatibility)
Note:
VISTA-Bench.tsvandVISTA-Bench-VT.tsvare currently identical; we keep both filenames for compatibility and will refine the organization later.
images/: original images used by multimodal instancesquestions/: rendered question/option images for the visualized-text (VT) setting*.tsv: dataset metadata and file paths used for evaluation
We evaluate VISTA-Bench with VLMEvalKit/.
Before running evaluation, we recommend converting the TSV file(s) into a normalized format with absolute image paths.
Helper script:
VISTA-Bench/VLMEvalKit/utils/convert_data_file.py
What it does:
- converts the TSV encoding to UTF-8
- normalizes path separators (
\→/) - renames
options_A/B/C/D→A/B/C/Dwhen needed - converts
image_pathandquestion_image_pathinto a bracket-style multi-path string
Example:
python VISTA-Bench/VLMEvalKit/utils/convert_data_file.py \ --in VISTA-Bench/VISTA-Bench.tsv \ --out VISTA-Bench/VISTA-Bench_norm.tsv \ --image-prefix /ABS/PATH/TO/VISTA-Bench \
--in: input TSV path (the original dataset TSV to be converted)--out: output TSV path (the converted/normalized TSV produced by this script)--image-prefix: the dataset root directory where images/, questions/, and the TSV files are located (used to resolve relative paths)
After conversion, rename the TSV column header image to image-1 to avoid an AssertionError in some VLMEvalKit setups.
Pure-text:
python /VISTA-Bench/VLMEvalKit/run.py \ --data VISTA-Bench_norm \ --model llava_v1.5_7b \ --verbose
Visualized-text (VT):
python /VISTA-Bench/VLMEvalKit/run.py \ --data VISTA-Bench-VT \ --model llava_v1.5_7b \ --verbose
--data: the dataset name corresponding to your converted TSV (e.g.,VISTA-Bench_norm). The VT split (VISTA-Bench-VT) should also be converted and can be run directly using this name because it is registered inVLMEvalKit/vlmeval/dataset/image_mm_mcq.pyviaDATASET_URL/MD5.--model: the model name defined inVLMEvalKit/vlmeval/config.py(make sure the corresponding weights are available in your environment).--verbose: print detailed logs during evaluation.- Outputs: the final report includes
overallresults andl1-categoriesbreakdown.
| Modality Comparison (VT vs. Text) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | Multimodal | Multimodal | Multimodal | Unimodal | Overall | ↓ Gap | |||||
| Perception | Reasoning | Knowledge | Knowledge | ||||||||
| VT | Text | VT | Text | VT | Text | VT | Text | VT | Text | ||
| ▼ Vision-Language Models (2B) | |||||||||||
| DeepSeek-VL2-Tiny | 44.3 | 64.0 | 31.3 | 43.7 | 27.5 | 28.8 | 27.6 | 41.8 | 31.7 | 43.1 | ↓-11.4 |
| Qwen3-VL-2B-Instruct | 51.3 | 69.0 | 32.7 | 49.7 | 17.5 | 24.0 | 37.2 | 52.0 | 33.9 | 47.5 | ↓-13.6 |
| Ovis2-2B | 58.3 | 66.7 | 39.7 | 52.3 | 27.0 | 31.2 | 36.0 | 50.2 | 38.8 | 48.9 | ↓-10.1 |
| NEO-2B-SFT | 40.0 | 68.3 | 31.3 | 49.3 | 25.3 | 37.3 | 29.4 | 53.4 | 30.8 | 51.3 | ↓-20.5 |
| Qwen2.5-VL-3B-Instruct | 65.0 | 67.7 | 43.3 | 54.3 | 32.5 | 35.8 | 54.8 | 56.6 | 48.6 | 52.8 | ↓-4.2 |
| InternVL3.5-2B | 56.0 | 66.3 | 39.3 | 50.3 | 30.5 | 39.8 | 44.8 | 57.0 | 42.1 | 52.9 | ↓-10.8 |
| SAIL-VL2-2B | 65.3 | 69.7 | 47.3 | 57.7 | 32.2 | 39.0 | 43.6 | 54.8 | 45.7 | 54.1 | ↓-8.4 |
| Ovis2.5-2B | 66.3 | 69.7 | 51.7 | 58.0 | 28.5 | 39.5 | 51.8 | 60.0 | 48.5 | 56.1 | ↓-7.6 |
| ▼ Vision-Language Models (8B) | |||||||||||
| LLaVA-1.5-7B | 33.0 | 58.7 | 27.3 | 44.3 | 26.5 | 27.3 | 24.0 | 48.8 | 27.1 | 44.1 | ↓-17.0 |
| LLaVA-OneVision-7B | 40.3 | 66.0 | 27.0 | 56.3 | 20.0 | 35.3 | 27.0 | 58.6 | 27.8 | 53.4 | ↓-25.6 |
| Qwen2.5-VL-7B-Instruct | 65.7 | 65.3 | 52.7 | 53.0 | 27.0 | 35.3 | 62.4 | 62.0 | 51.7 | 53.7 | ↓-2.0 |
| MiniCPM-V-4\_5 | 64.3 | 71.6 | 45.7 | 60.3 | 31.5 | 36.0 | 50.4 | 55.0 | 47.2 | 54.3 | ↓-7.1 |
| Qwen3-VL-8B-Instruct | 65.3 | 67.3 | 49.0 | 49.3 | 37.5 | 44.5 | 57.8 | 68.2 | 52.1 | 57.9 | ↓-5.8 |
| Ovis2-8B | 66.7 | 71.0 | 47.7 | 60.0 | 29.0 | 39.8 | 50.8 | 65.4 | 47.5 | 58.6 | ↓-11.1 |
| InternVL3.5-8B | 61.3 | 64.3 | 45.7 | 52.3 | 35.3 | 44.5 | 57.6 | 71.2 | 50.0 | 58.9 | ↓-8.9 |
| MiMo-VL-7B-RL | 70.3 | 69.0 | 58.3 | 61.3 | 40.5 | 38.3 | 69.0 | 68.8 | 59.5 | 59.2 | ↑+0.3 |
| NEO-9B-SFT | 32.7 | 69.0 | 29.0 | 58.0 | 24.8 | 40.5 | 28.6 | 69.2 | 28.5 | 59.3 | ↓-30.8 |
| LLaVA-OneVision-1.5-8B | 62.7 | 68.7 | 46.3 | 59.0 | 33.5 | 42.5 | 57.4 | 67.8 | 49.9 | 59.5 | ↓-9.6 |
| MiMo-VL-7B-SFT | 68.3 | 69.3 | 61.3 | 62.7 | 40.5 | 41.0 | 70.6 | 72.0 | 60.3 | 61.3 | ↓-1.0 |
| SAIL-VL2-8B | 68.7 | 70.0 | 54.3 | 60.7 | 37.8 | 44.8 | 58.0 | 71.0 | 54.0 | 61.7 | ↓-7.7 |
| Ovis2.5-9B | 68.3 | 69.0 | 56.3 | 65.3 | 38.8 | 52.0 | 66.0 | 73.8 | 57.3 | 65.3 | ↓-8.0 |
| GLM-4.1V-9B-Thinking | 70.7 | 71.3 | 58.7 | 61.7 | 50.0 | 52.5 | 73.8 | 75.8 | 63.8 | 65.9 | ↓-2.1 |
| ▼ Vision-Language Models (30B-A3B) | |||||||||||
| Kimi-VL-A3B-Thinking | 69.0 | 71.0 | 49.7 | 61.7 | 30.8 | 33.5 | 52.6 | 66.4 | 49.4 | 57.6 | ↓-8.2 |
| Qwen3-VL-30B-A3B-Instruct | 64.3 | 71.0 | 51.0 | 60.0 | 33.0 | 44.0 | 54.2 | 71.0 | 49.9 | 61.6 | ↓-11.7 |
| InternVL3.5-30B-A3B | 64.3 | 70.3 | 50.3 | 61.7 | 40.8 | 50.3 | 61.8 | 75.2 | 54.4 | 64.9 | ↓-10.5 |
Caption: Comparison of different VLMs on our benchmark. Results are reported under Visualized Text (VT) and Text inputs for each metric. The best result per column is bolded. The ↓Gap column denotes the overall performance drop when switching from Text to Visualized Text. All metrics are reported as percentages (%).
- Qing'an Liu: 2223884741@mail.dlut.edu.cn
- Juntong Feng: 2253762636@mail.dlut.edu.cn
To be added.
To be added.