Data Preparation for JavisDiT++

Stage1 - Audio Pretraining

In this stage, we only need audio files to initialize the audio generation capability:

path	id	relpath	num_frames	height	width	aspect_ratio	fps	resolution	audio_path	audio_fps	text	audio_text
placeholder.mp4	xxx	xxx.mp4	240	480	640	0.75	24	307200	/path/to/xxx.wav	16000	placeholder	yyy

Download the audios (including AudioCaps, VGGSound, AudioSet, WavCaps, Clotho, ESC50, MACS, UrbanSound8K, MusicInstrument, GTZAN, etc.), and put them into the same folder /path/to/audios. Follow the commands to automatically generate a train_audio.csv for configuration:

ROOT_AUDIO="/path/to/audios"
ROOT_META="./data/meta/audio"

# 1.1 Create a meta file from a unified audio folder. This should output ${ROOT_META}/meta.csv
python -m tools.datasets.convert audio ${ROOT_AUDIO} --output ${ROOT_META}/meta.csv

# 1.2 Get audio information. This should output ${ROOT_META}/meta_ainfo.csv
python -m tools.datasets.datautil ${ROOT_META}/meta.csv --audio-info

# 2.1 Trim audios within 30 seconds. This should overwrite the raw audios by default and output ${ROOT_META}/meta_ainfo_trim30s.csv
python -m tools.datasets.datautil ${ROOT_META}/audio_meta.csv --trim-audio 30

# 2.2 Unify the sample rate to 16k Hz for all audios. This should output ${ROOT_META}/audio_meta_trim30s_sr16000.csv
python -m tools.datasets.datautil ${ROOT_META}/meta_ainfo_trim30s.csv --resample-audio --audio-sr 16000

# 3.1 Set dummy videos. This should output ${ROOT_META}/audio_meta_trim30s_sr16000_dummy_videos.csv
python -m tools.datasets.datautil ${ROOT_META}/audio_meta_trim30s_sr16000.csv --dummy-video

# 3.2 Get training meta csv. This should output ${ROOT_META}/train_audio.csv
python -m tools.datasets.find_audio_ds all \
    --data_root ${ROOT_AUDIO} \
    --meta_file ${ROOT_META}/audio_meta_trim30s_sr16000_dummy_videos.csv \
    --save_file ${ROOT_META}/train_audio.csv

Stage2 - Audio-Video SFT

Here we provide an example with TAVGBench to prepare video-audio-text triplets for training. You can easily transfer to your own datasets.

path	id	relpath	num_frames	height	width	aspect_ratio	fps	resolution	audio_path	audio_fps	text
/path/to/xxx.mp4	xxx	xxx.mp4	160	480	640	0.75	16	307200	/path/to/xxx.wav	16000	yyy

The following script will automatically generate a train_av_sft.csv for configuration:

ROOT_VIDEO="/path/to/videos"
ROOT_META="./data/meta/video"

fmin=10  # minial frames for each video
fps=16  # for Wan2.1-1.3B

# 1.1 Create a meta file from a video folder. This should output ${ROOT_META}/meta.csv
python -m tools.datasets.convert video ${ROOT_VIDEO} --output ${ROOT_META}/meta.csv

# 1.2 Get video information and remove broken videos. This should output ${ROOT_META}/meta_info_fmin${fmin}.csv
python -m tools.datasets.datautil ${ROOT_META}/meta.csv --info --fmin ${fmin}

# 2.1 Unify FPS to 16 Hz for all videos. This will change the raw videos, and output ${ROOT_META}/meta_info_fmin${fmin}_fps${fps}.csv
python -m tools.datasets.datautil ${ROOT_META}/meta_info_fmin${fmin}.csv --uni-fps ${fps} --overwrite

# 3.1 Get training meta csv. This should output ${ROOT_META}/train_jav.csv
python -m tools.datasets.find_jav_ds tavgbench \
    --meta_src /path/to/TAVGBench/release_captions.txt \
    --meta_file ${ROOT_META}/meta_info_fmin${fmin}_fps${fps}.csv \
    --save_file ${ROOT_META}/train_av_sft.csv

If you get multiple data sources, just merge the csv files to a single one:

python -m tools.datasets.datautil ds1.csv ds2.csv ... --output /path/to/output.csv

Stage3 - Audio-Video DPO

To run DPO, you need to prepare a data pool isolated from the SFT training data, organized into a train_av_dpo_raw.csv:

path	id	relpath	num_frames	height	width	aspect_ratio	fps	resolution	audio_path	audio_fps	text
/path/to/xxx.mp4	xxx	xxx.mp4	160	480	640	0.75	16	307200	/path/to/xxx.wav	16000	yyy

Then, run inference to generate N (e.g., 3) audio-video samples for each prompt input:

model_path="/path/to/av_sft_ckpt"
src_meta_path="/path/to/train_av_dpo_raw.csv"
save_dir="/path/to/dpo_gen"

resolution=480p # or 240p
num_frames=81  # 5s
aspect_ratio="9:16"
num_sample=3

cfg_file="configs/wan2.1/inference/sample.py"

torchrun --standalone --nproc_per_node 8 \
    scripts/inference.py \
    ${cfg_file} \
    --resolution ${resolution} --num-frames ${num_frames} --aspect-ratio ${aspect_ratio} \
    --prompt-path ${src_meta_path} --model-path ${model_path} --num-sample ${num_sample}$ \
    --save-dir ${save_dir} --verbose 1

Next, gather 1 ground-truth audio-video pair with 3 generated audio-video pair for each prompt input, and :

src_meta_path="/path/to/train_av_dpo_raw.csv"
data_dir="/path/to/dpo_gen"
gen_meta_path="/path/to/train_av_dpo_gen.csv"

# first, create a meta file and extract audios from videos
python -m tools.datasets.convert video ${data_dir} --output ${data_dir}/meta.csv
python -m tools.datasets.datautil ${data_dir}/meta.csv --info --fmin 1
python -m tools.datasets.datautil ${ROOT_META}/meta_info_fmin1.csv --extract-audio --audio-sr 16000

# second, gather the generated audio-video pairs
python -m tools.datasets.process_dpo \
    --task gather_dpo_gen \
    --src_meta_path ${src_meta_path}$ \
    --tgt_meta_path ${ROOT_META}/meta_info_fmin1_au_sr16000.csv \
    --out_meta_path ${gen_meta_path}$

Then, score the 1+3 candidates for modality-aware rewarding, and the results will be saved at ./evaluation_results/audio_video_dpo_reward/avdpo_gen_avreward.csv.

gen_meta_path="/path/to/train_av_dpo_gen.csv"
res_dir="./evaluation_results/audio_video_dpo_reward"

METRICS="av-reward"
MAX_AUDIO_LEN_S=5.0 

export CUDA_VISIBLE_DEVICES="0"
torchrun --nproc_per_node=1 -m eval.javisbench.main \
    --input_file ${gen_meta_path} \
    --output_file "${res_dir}/avdpo_gen.json" \
    --max_audio_len_s ${MAX_AUDIO_LEN_S} \
    --metrics ${METRICS}

Finally, ranking the generated samples and select the chosen-reject (or win-lose) sample pairs for DPO training:

src_meta_path="./evaluation_results/audio_video_dpo_reward/avdpo_gen_avreward.csv"
out_meta_path="./data/meta/avdpo/train_av_dpo.csv"

python -m tools.datasets.process_dpo \
    --task rank_dpo_pair \
    --src_meta_path ${src_meta_path}$ \
    --out_meta_path ${out_meta_path}$