PyVision: Agentic Vision with Dynamic Tooling.
We investigate Python code as visual primitives for image manipulation and reasoning. GPT-4.1 can generate code for image modifications, offering a novel approach to Visual Question Answering (VQA).
We investigate Python code as visual primitives for image manipulation and reasoning. GPT-4.1 can generate code for image modifications, offering a novel approach to Visual Question Answering (VQA).
We collect a series of diverse VQA tasks, designed for agentic vision, including color recognition, low-light, instrument reading, jigsaw, math, maze, rotated OCR, proportion, rotation, spot the difference, symbolic reasoning, visual search, and word search.
We build PyVision-Image and PyVision-Video via RL, achieving state-of-the-art on visual search, multi-modal reasoning, agentic reasoning and spatial reasoning tasks.