PyVision - Agents-X

Introduction

Large language models (LLMs) have rapidly evolved from passive chatbots into actionable agents capable of multi-turn interaction and tool use. Beyond proprietary systems, a growing body of research has explored how to endow open-weight models with tool-using capabilities, particularly for tasks such as deep research and computer use that demand sustained interaction with external environments.

More recently, this agentic paradigm has extended from purely textual domains to multimodal reasoning. Works such as OpenAI o3 demonstrate that incorporating tool use into visual understanding can ground multimodal reasoning in task-relevant visual evidence, enabling models to actively manipulate visual inputs rather than passively process them. This motivates the development of multimodal agents that reason, act, and interact over images and videos.

Existing approaches to multimodal tool use largely follow two design paradigms. One line of work relies on static toolsets, where a fixed set of task-specific tools, such as cropping, zooming, or video clipping, is manually predefined and exposed to the model. While effective for specific tasks, these approaches lack flexibility and require task-dependent engineering. An alternative paradigm, dynamic tooling, treats Python as a primitive tool, allowing the model to synthesize task-specific operations on the fly. This approach enables expressive and compositional tool use, but has so far remained largely limited to image understanding and often relies on proprietary APIs, leaving open-weight multimodal reinforcement learning (RL) underexplored, especially for video.

A key challenge in training such agentic multimodal models lies in training stability and avoiding interaction collapse. Prior work observes that after RL fine-tuning, models tend to reduce tool usage, converging to short, low-interaction behaviors. This has led to skepticism about the effectiveness of test-time interaction scaling for agentic visual understanding, in contrast to its success in textual reasoning. We argue that this limitation does not reflect an inherent weakness of interaction, but rather insufficient training incentives and unstable rollout selection during RL.

In this paper, we present PyVision-RL, an agentic training framework for open-weight multimodal models that addresses these challenges. We adopt Python as a primitive tool to enable dynamic tooling for both image and video understanding, and apply reinforcement learning with two key innovations: (1) an oversampling–filtering–ranking framework for rollout generation that stabilizes agent–environment interaction, and (2) an accumulative tool reward that explicitly incentivizes sustained multi-turn tool usage. Using a unified training pipeline, we introduce two models: PyVision-Image for image understanding and PyVision-Video for video understanding. Notably, PyVision-Video employs on-demand context construction, where the full video is loaded only into the Python runtime, and the model selectively samples and plots task-relevant frames via Python code during reasoning. This agentic frame fetching strategy avoids uniform frame sampling, substantially reducing visual token consumption while improving reasoning efficiency.

Our models achieve strong empirical results. PyVision-Image attains state-of-the-art performance on visual search, multimodal reasoning, and agentic reasoning benchmarks, outperforming prior methods such as DeepEyes-v2 by +6.9% on V* and +9.6% on WeMath. PyVision-Video surpasses VITAL, a multimodal agent with a video clipping tool, by +2.2% on VSI-Bench, while using significantly fewer visual tokens. Enabled by on-demand context construction, PyVision-Video achieves a favorable performance–efficiency trade-off, using on average 5K visual tokens per sample compared to 45K for Qwen2.5-VL-7B, yet attaining higher accuracy: 44.0% for PyVision-Video versus 38.0% for Qwen2.5-VL-7B.

In summary, we introduce PyVision-RL, a unified agentic reinforcement learning framework for open-weight multimodal models that enables tool-based reasoning over both images and videos. By combining an oversampling–filtering–ranking rollout strategy with an accumulative tool reward, our approach prevents interaction collapse and effectively incentivizes multi-turn agent behavior. The resulting models, PyVision-Image and PyVision-Video, demonstrate that sustained interaction and tool use remain powerful mechanisms for multimodal reasoning when trained with appropriate incentives, achieving state-of-the-art performance while substantially improving token efficiency, particularly for video understanding.

Agent Scaffold

We design two agentic scaffolds for image and video understanding under a unified framework of dynamic tooling with Python. For PyVision-Image, both the system prompt and image hints are injected into the MLLM context, and the images are also loaded into the Python runtime. For PyVision-Video, only the system prompt is injected into the MLLM context, while the video is loaded exclusively into the runtime environment. Given a query, the model interleaves reasoning with executable code blocks (code_block_0) to process multimodal inputs. Execution results (mm_clue_0), including textual outputs and rendered images, are appended to the context and fed back to the model. This interaction loop repeats until a final answer is produced. By restricting video inputs first to the runtime, PyVision-Video enables on-demand context construction, where the agent selectively samples and plots task-relevant frames during reasoning, substantially improving visual token efficiency.

Results

Cases

All examples were generated by PyVision-Image and PyVision-Video. There are mistakes in some cases, but we mainly want to share the interesting reasoning trajectories.

Conclusion

We present PyVision-RL, a unified agentic multimodal framework for image and video understanding that adopt Python for dynamic tooling. To stabilize tool-use RL, we introduce an oversampling–filtering–ranking framework for rollout generation, and show increasing the max turn budget leads to a higher performance ceiling. Empirically, PyVision-Image achieves strong performance across benchmarks, outperforming prior agentic MLLMs. PyVision-Video shows effective spatial reasoning while substantially reducing visual token usage, achieving a favorable accuracy–efficiency trade-off on VSI-Bench. Together, these results highlight the effectiveness of dynamic tooling and sustained interaction for multimodal agentic reasoning.

Bibtex

@article{zhao2026pyvisionrl,
  title={PyVision-RL: Forging Open Agentic Vision Models via RL.},
  author={Zhao, Shitian and Lin, Shaoheng and Li, Ming and Zhang, Haoquan and Peng, Wenshuo and Zhang, Kaipeng and Wei, Chen},
  journal={arxiv preprint arxiv:2602.20739},
  year={2026},
}