Leaderboard
We benchmark 22 MLLMs across open-source, proprietary, and tool-using settings. You can interactively rank models by metric and group below.
Table 2 task abbreviations: SR = Symbolic Reasoning, WS = Word Search, LL-VQA = Low-Light VQA, IR = Instrument Reasoning, SD = Spot Difference, JG = Jigsaw Game, VS = Visual Search, RG = Rotation Game, Pro. = Proportion VQA.
Rank
Model
Group
Score
BibTeX
Citation
@article{li2025tir,
title={TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning},
author={Li, Ming and Zhong, Jike and Zhao, Shitian and Zhang, Haoquan and Lin, Shaoheng and Lai, Yuxiang and Wei, Chen and Psounis, Konstantinos and Zhang, Kaipeng},
journal={arXiv preprint arXiv:2511.01833},
year={2025}
}












