STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

arXiv Project Page Paper

This repository contains the checkpoint for STVG-R1, the first reinforcement learning framework for spatial–temporal video grounding. We finetuned Qwen2.5-VL-7B-Instruct with constructed Visual Prompt Data.


πŸ“‹ Key Features

  • Object-Centric Visual Prompt: A simple yet effective object-centric visual prompting paradigm reformulates dense per-frame coordinate prediction into a compact object ID identification task.
  • Reinforcement Learning Training: STVG-R1 is trained entirely using reinforcement learning, enhancing its ability to generate accurate spatial temporal visual grounding results.
  • Strong Zero-Shot Generalization: STVG-R1 exhibits strong zero-shot generalization to multi-object referring video object segmentation task, despite being trained only on single-object grounding data.
  • SOTA Performance: STVG-R1 sets a new SOTA on the HCSTVG-v1, HCSTVG-v2, ST-Align and MeViS benchmarks.

πŸ’Ύ Model Details

  • Base Model: Qwen2.5-VL-7B
  • Training Data: HCSTVG-v1, HCSTVG-v2 and VidSTG

πŸ“ Citation

If you use STVG-R1 or this checkpoint in your research, please cite:

Downloads last month
3
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for XiaowenZhang/stvg-r1-model-7b

Finetuned
(904)
this model