PyVision-RL: Forging Open Agentic Vision Models via RL.
We build PyVision-Image and PyVision-Video via RL, achieving state-of-the-art on visual search, multi-modal reasoning, agentic reasoning and spatial reasoning tasks.
TIR-Bench: A Comprehensive Benchmark for Agentic Vision.
After PyVision, we revisited the fundamental question of what kinds of problems truly require agentic vision capabilities. We introduce TIR-Bench, the first comprehensive agentic vision benchmark, consisting of 1,215 questions and covering 13 different tasks.
PyVision: Agentic Vision with Dynamic Tooling.
We explore using Python code as visual primitives for image manipulation and reasoning. We developed PyVision, a framework that enables agentic vision with dynamic tooling. Your MLLM already possesses agentic vision capabilities!