Data Preparation for JavisDiT++
Stage1 - Audio Pretraining
In this stage, we only need audio files to initialize the audio generation capability:
| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text | audio_text |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| placeholder.mp4 | xxx | xxx.mp4 | 240 | 480 | 640 | 0.75 | 24 | 307200 | /path/to/xxx.wav | 16000 | placeholder | yyy |
Download the audios (including AudioCaps, VGGSound, AudioSet, WavCaps, Clotho, ESC50, MACS, UrbanSound8K, MusicInstrument, GTZAN, etc.), and put them into the same folder /path/to/audios. Follow the commands to automatically generate a train_audio.csv for configuration:
ROOT_AUDIO="/path/to/audios"
ROOT_META="./data/meta/audio"
# 1.1 Create a meta file from a unified audio folder. This should output ${ROOT_META}/meta.csv
python -m tools.datasets.convert audio ${ROOT_AUDIO} --output ${ROOT_META}/meta.csv
# 1.2 Get audio information. This should output ${ROOT_META}/meta_ainfo.csv
python -m tools.datasets.datautil ${ROOT_META}/meta.csv --audio-info
# 2.1 Trim audios within 30 seconds. This should overwrite the raw audios by default and output ${ROOT_META}/meta_ainfo_trim30s.csv
python -m tools.datasets.datautil ${ROOT_META}/audio_meta.csv --trim-audio 30
# 2.2 Unify the sample rate to 16k Hz for all audios. This should output ${ROOT_META}/audio_meta_trim30s_sr16000.csv
python -m tools.datasets.datautil ${ROOT_META}/meta_ainfo_trim30s.csv --resample-audio --audio-sr 16000
# 3.1 Set dummy videos. This should output ${ROOT_META}/audio_meta_trim30s_sr16000_dummy_videos.csv
python -m tools.datasets.datautil ${ROOT_META}/audio_meta_trim30s_sr16000.csv --dummy-video
# 3.2 Get training meta csv. This should output ${ROOT_META}/train_audio.csv
python -m tools.datasets.find_audio_ds all \
--data_root ${ROOT_AUDIO} \
--meta_file ${ROOT_META}/audio_meta_trim30s_sr16000_dummy_videos.csv \
--save_file ${ROOT_META}/train_audio.csv
Stage2 - Audio-Video SFT
Here we provide an example with TAVGBench to prepare video-audio-text triplets for training. You can easily transfer to your own datasets.
| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text |
|---|---|---|---|---|---|---|---|---|---|---|---|
| /path/to/xxx.mp4 | xxx | xxx.mp4 | 160 | 480 | 640 | 0.75 | 16 | 307200 | /path/to/xxx.wav | 16000 | yyy |
The following script will automatically generate a train_av_sft.csv for configuration:
ROOT_VIDEO="/path/to/videos"
ROOT_META="./data/meta/video"
fmin=10 # minial frames for each video
fps=16 # for Wan2.1-1.3B
# 1.1 Create a meta file from a video folder. This should output ${ROOT_META}/meta.csv
python -m tools.datasets.convert video ${ROOT_VIDEO} --output ${ROOT_META}/meta.csv
# 1.2 Get video information and remove broken videos. This should output ${ROOT_META}/meta_info_fmin${fmin}.csv
python -m tools.datasets.datautil ${ROOT_META}/meta.csv --info --fmin ${fmin}
# 2.1 Unify FPS to 16 Hz for all videos. This will change the raw videos, and output ${ROOT_META}/meta_info_fmin${fmin}_fps${fps}.csv
python -m tools.datasets.datautil ${ROOT_META}/meta_info_fmin${fmin}.csv --uni-fps ${fps} --overwrite
# 3.1 Get training meta csv. This should output ${ROOT_META}/train_jav.csv
python -m tools.datasets.find_jav_ds tavgbench \
--meta_src /path/to/TAVGBench/release_captions.txt \
--meta_file ${ROOT_META}/meta_info_fmin${fmin}_fps${fps}.csv \
--save_file ${ROOT_META}/train_av_sft.csv
If you get multiple data sources, just merge the csv files to a single one:
python -m tools.datasets.datautil ds1.csv ds2.csv ... --output /path/to/output.csv
Stage3 - Audio-Video DPO
To run DPO, you need to prepare a data pool isolated from the SFT training data, organized into a train_av_dpo_raw.csv:
| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text |
|---|---|---|---|---|---|---|---|---|---|---|---|
| /path/to/xxx.mp4 | xxx | xxx.mp4 | 160 | 480 | 640 | 0.75 | 16 | 307200 | /path/to/xxx.wav | 16000 | yyy |
Then, run inference to generate N (e.g., 3) audio-video samples for each prompt input:
model_path="/path/to/av_sft_ckpt"
src_meta_path="/path/to/train_av_dpo_raw.csv"
save_dir="/path/to/dpo_gen"
resolution=480p # or 240p
num_frames=81 # 5s
aspect_ratio="9:16"
num_sample=3
cfg_file="configs/wan2.1/inference/sample.py"
torchrun --standalone --nproc_per_node 8 \
scripts/inference.py \
${cfg_file} \
--resolution ${resolution} --num-frames ${num_frames} --aspect-ratio ${aspect_ratio} \
--prompt-path ${src_meta_path} --model-path ${model_path} --num-sample ${num_sample}$ \
--save-dir ${save_dir} --verbose 1
Next, gather 1 ground-truth audio-video pair with 3 generated audio-video pair for each prompt input, and :
src_meta_path="/path/to/train_av_dpo_raw.csv"
data_dir="/path/to/dpo_gen"
gen_meta_path="/path/to/train_av_dpo_gen.csv"
# first, create a meta file and extract audios from videos
python -m tools.datasets.convert video ${data_dir} --output ${data_dir}/meta.csv
python -m tools.datasets.datautil ${data_dir}/meta.csv --info --fmin 1
python -m tools.datasets.datautil ${ROOT_META}/meta_info_fmin1.csv --extract-audio --audio-sr 16000
# second, gather the generated audio-video pairs
python -m tools.datasets.process_dpo \
--task gather_dpo_gen \
--src_meta_path ${src_meta_path}$ \
--tgt_meta_path ${ROOT_META}/meta_info_fmin1_au_sr16000.csv \
--out_meta_path ${gen_meta_path}$
Then, score the 1+3 candidates for modality-aware rewarding, and the results will be saved at ./evaluation_results/audio_video_dpo_reward/avdpo_gen_avreward.csv.
gen_meta_path="/path/to/train_av_dpo_gen.csv"
res_dir="./evaluation_results/audio_video_dpo_reward"
METRICS="av-reward"
MAX_AUDIO_LEN_S=5.0
export CUDA_VISIBLE_DEVICES="0"
torchrun --nproc_per_node=1 -m eval.javisbench.main \
--input_file ${gen_meta_path} \
--output_file "${res_dir}/avdpo_gen.json" \
--max_audio_len_s ${MAX_AUDIO_LEN_S} \
--metrics ${METRICS}
Finally, ranking the generated samples and select the chosen-reject (or win-lose) sample pairs for DPO training:
src_meta_path="./evaluation_results/audio_video_dpo_reward/avdpo_gen_avreward.csv"
out_meta_path="./data/meta/avdpo/train_av_dpo.csv"
python -m tools.datasets.process_dpo \
--task rank_dpo_pair \
--src_meta_path ${src_meta_path}$ \
--out_meta_path ${out_meta_path}$