Home

Color100

Color

Q: What is the number in the center of this image? Select from the following choices.

Answer: D. 87

Low-Light50

Low-Light

Q: How many ducks are in the image?

Answer: 3

Instrument Reading80

Instrument Reading

Q: What is the stopwatch second reading in the image? Provide an integer.

Answer: 5

Jigsaw120

Jigsaw

Q: Determine the correct arrangement to restore the original image.

Answer: 7,5,1,2,4,9,3,8,6

Math120

Math

Q: Which arch is the longest?
A. AB B. BC C. CD
D. DE E. EG F. GA

Answer: E. EG

$Math task case example$

Maze120

Maze

Q: Please complete a maze game starting from the red ball to the green ball.

Answer: B. DDRRU...RRR

Rotated OCR60

Rotated OCR

Q: What is written in the image?

Answer: jump

Proportion120

Proportion

Q: Which of the following values is the closest to the proportion of the image occupied by chair in blue bottom right corner?

Answer: F. 7%

Rotation75

Rotation

Q: How many degrees should you rotate this image CLOCKWISE to restore it to its original orientation?

Answer: F. 330°

Spot Diff.100

Spot the Difference

Q: Output the list of patch indices where the two images differ.

Answer: 1,2,6,7,9,15,16,20

Symbolic Reasoning50

Symbolic Reasoning

Q: How many edges are there in this polygon?

Answer: E. 16

Visual Search120

Visual Search

Q: Is a brown sheep present to the left of the central white horse?

Answer: B. No

Word Search100

Word Search

Q: In the figure, in which row and column does the number 8 appear?

Answer: [17, 1]

TIR Bench

A Comprehensive Benchmark for
Agentic Thinking-with-Images Reasoning

Total: 1215

Scroll to details

Nov. 2025 Initialized

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

A broad benchmark suite for evaluating tool-using visual agents across diverse VQA tasks.

Code Dataset Paper

AUTHORS

Ming Li^1,*, Jike Zhong^2,*, Shitian Zhao^1,*, Haoquan Zhang^1,4,*, Shaoheng Lin^1,*, Yuxiang Lai^3,*, Chen Wei⁵, Konstantinos Psounis², Kaipeng Zhang^1,†

* Core Contributors; † Corresponding Author

AFFILIATIONS

¹ Shanghai AI Laboratory

² University of Southern California

³ Emory University

⁴ Chinese University of Hong Kong

⁵ Rice University

Leaderboard

We benchmark 22 MLLMs across open-source, proprietary, and tool-using settings. You can interactively rank models by metric and group below.

Metric Model Group

Table 2 task abbreviations: SR = Symbolic Reasoning, WS = Word Search, LL-VQA = Low-Light VQA, IR = Instrument Reasoning, SD = Spot Difference, JG = Jigsaw Game, VS = Visual Search, RG = Rotation Game, Pro. = Proportion VQA.

Rank Model Group Score

BibTeX

Citation

@article{li2025tir,
  title={TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning},
  author={Li, Ming and Zhong, Jike and Zhao, Shitian and Zhang, Haoquan and Lin, Shaoheng and Lai, Yuxiang and Wei, Chen and Psounis, Konstantinos and Zhang, Kaipeng},
  journal={arXiv preprint arXiv:2511.01833},
  year={2025}
}