Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation
Here is the training checkpoints of Diff-V2M (AAAI'26)
Overview
Diff-V2M is a hierarchical diffusion model with explicit rhythmic modeling and multi-view feature conditioning, achieving state-of-the-art results in video-to-music generation..

Model Sources
- Repository: https://github.com/Tayjsl97/Diff-V2M
- Demo: demo page
Citation
If you use our models in your research, please cite it as follows:
@inproceedings{ji2026diff,
title={Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation},
author={Ji, Shulei and Wang, Zihao and Yu, Jiaxing and Yang, Xiangyuan and Li, Shuyu and Wu, Songruoyao and Zhang, Kejun},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2026}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support