Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation

Here is the training checkpoints of Diff-V2M (AAAI'26)

Overview

Diff-V2M is a hierarchical diffusion model with explicit rhythmic modeling and multi-view feature conditioning, achieving state-of-the-art results in video-to-music generation.. model

Model Sources

Repository: https://github.com/Tayjsl97/Diff-V2M
Demo: demo page

Citation

If you use our models in your research, please cite it as follows:

@inproceedings{ji2026diff,
  title={Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation},
  author={Ji, Shulei and Wang, Zihao and Yu, Jiaxing and Yang, Xiangyuan and Li, Shuyu and Wu, Songruoyao and Zhang, Kejun},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support