Home
Color100
Color
Q: What is the number in the center of this image? Select from the following choices.
Answer: D. 87
Color task case example
Low-Light50
Low-Light
Q: How many ducks are in the image?
Answer: 3
Low-light task case example
Instrument Reading80
Instrument Reading
Q: What is the stopwatch second reading in the image? Provide an integer.
Answer: 5
Instrument reading case example
Jigsaw120
Jigsaw
Q: Determine the correct arrangement to restore the original image.
Answer: 7,5,1,2,4,9,3,8,6
Jigsaw task case example
Math120
Math
Q: Which arch is the longest?
A. AB  B. BC  C. CD
D. DE  E. EG  F. GA
Answer: E. EG
Math task case example
Maze120
Maze
Q: Please complete a maze game starting from the red ball to the green ball.
Answer: B. DDRRU...RRR
Maze task case example
Rotated OCR60
Rotated OCR
Q: What is written in the image?
Answer: jump
Rotated OCR task case example
Proportion120
Proportion
Q: Which of the following values is the closest to the proportion of the image occupied by chair in blue bottom right corner?
Answer: F. 7%
Proportion task case example
Rotation75
Rotation
Q: How many degrees should you rotate this image CLOCKWISE to restore it to its original orientation?
Answer: F. 330°
Rotation task case example
Spot Diff.100
Spot the Difference
Q: Output the list of patch indices where the two images differ.
Answer: 1,2,6,7,9,15,16,20
Spot the difference case example
Symbolic Reasoning50
Symbolic Reasoning
Q: How many edges are there in this polygon?
Answer: E. 16
Symbolic reasoning case example
TIR Bench
A Comprehensive Benchmark for
Agentic Thinking-with-Images Reasoning
Total: 1215
Scroll to details
Nov. 2025 Initialized

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

A broad benchmark suite for evaluating tool-using visual agents across diverse VQA tasks.

AUTHORS
Ming Li1,*, Jike Zhong2,*, Shitian Zhao1,*, Haoquan Zhang1,4,*, Shaoheng Lin1,*, Yuxiang Lai3,*, Chen Wei5, Konstantinos Psounis2, Kaipeng Zhang1,†
* Core Contributors;  † Corresponding Author
AFFILIATIONS
1 Shanghai AI Laboratory
2 University of Southern California
3 Emory University
4 Chinese University of Hong Kong
5 Rice University

Leaderboard

We benchmark 22 MLLMs across open-source, proprietary, and tool-using settings. You can interactively rank models by metric and group below.

Table 2 task abbreviations: SR = Symbolic Reasoning, WS = Word Search, LL-VQA = Low-Light VQA, IR = Instrument Reasoning, SD = Spot Difference, JG = Jigsaw Game, VS = Visual Search, RG = Rotation Game, Pro. = Proportion VQA.
Rank Model Group Score

BibTeX

Citation
@article{li2025tir,
  title={TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning},
  author={Li, Ming and Zhong, Jike and Zhao, Shitian and Zhang, Haoquan and Lin, Shaoheng and Lai, Yuxiang and Wei, Chen and Psounis, Konstantinos and Zhang, Kaipeng},
  journal={arXiv preprint arXiv:2511.01833},
  year={2025}
}