Abstract
While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.
Community
TL;DR: We introduce a new method that improves visual reasoning by allowing models to implicitly learn latent visual representations, without requiring explicit supervision or additional data for these latents.
arXiv lens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/latent-implicit-visual-reasoning-9998-b5005eaa
- Executive Summary
- Detailed Breakdown
- Practical Applications
arXiv explained breakdown of this paper ๐ https://arxivexplained.com/papers/latent-implicit-visual-reasoning
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Mull-Tokens: Modality-Agnostic Latent Thinking (2025)
- Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs (2025)
- Interleaved Latent Visual Reasoning with Selective Perceptual Modeling (2025)
- Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models (2025)
- Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens (2025)
- SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards (2025)
- CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper