STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

This repository contains the checkpoint for STVG-R1, the first reinforcement learning framework for spatial–temporal video grounding. We finetuned Qwen2.5-VL-7B-Instruct with constructed Visual Prompt Data.

📋 Key Features

Object-Centric Visual Prompt: A simple yet effective object-centric visual prompting paradigm reformulates dense per-frame coordinate prediction into a compact object ID identification task.
Reinforcement Learning Training: STVG-R1 is trained entirely using reinforcement learning, enhancing its ability to generate accurate spatial temporal visual grounding results.
Strong Zero-Shot Generalization: STVG-R1 exhibits strong zero-shot generalization to multi-object referring video object segmentation task, despite being trained only on single-object grounding data.
SOTA Performance: STVG-R1 sets a new SOTA on the HCSTVG-v1, HCSTVG-v2, ST-Align and MeViS benchmarks.

💾 Model Details

Base Model: Qwen2.5-VL-7B
Training Data: HCSTVG-v1, HCSTVG-v2 and VidSTG

📝 Citation

If you use STVG-R1 or this checkpoint in your research, please cite:

Downloads last month: 3

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for XiaowenZhang/stvg-r1-model-7b

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(904)

this model