Visual grounding on videos

#13
by iariav - opened

Hi,
first of all - amzing work! the new Qwen-vl-3 models are awsome.

I use them for visual grounding (BBOX) in videos. Currently, as bbox detection is only supported for images, i run frame-by-frame @ 1 fps but that takes a very long time.
from your experience, is there a better way to achieve accurate, consistent bbox detections from videos?

Sign up or log in to comment