STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning
This repository contains the checkpoint for STVG-R1, the first reinforcement learning framework for spatialβtemporal video grounding. We finetuned Qwen2.5-VL-7B-Instruct with constructed Visual Prompt Data.
π Key Features
- Object-Centric Visual Prompt: A simple yet effective object-centric visual prompting paradigm reformulates dense per-frame coordinate prediction into a compact object ID identification task.
- Reinforcement Learning Training: STVG-R1 is trained entirely using reinforcement learning, enhancing its ability to generate accurate spatial temporal visual grounding results.
- Strong Zero-Shot Generalization: STVG-R1 exhibits strong zero-shot generalization to multi-object referring video object segmentation task, despite being trained only on single-object grounding data.
- SOTA Performance: STVG-R1 sets a new SOTA on the HCSTVG-v1, HCSTVG-v2, ST-Align and MeViS benchmarks.
πΎ Model Details
- Base Model: Qwen2.5-VL-7B
- Training Data: HCSTVG-v1, HCSTVG-v2 and VidSTG
π Citation
If you use STVG-R1 or this checkpoint in your research, please cite:
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for XiaowenZhang/stvg-r1-model-7b
Base model
Qwen/Qwen2.5-VL-7B-Instruct