File size: 7,381 Bytes
e490e7e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
## Data Preparation for JavisDiT++
### Stage1 - Audio Pretraining
In this stage, we only need audio files to initialize the audio generation capability:
| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text| audio_text|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| placeholder.mp4 | xxx | xxx.mp4 | 240 | 480 | 640 | 0.75 | 24 | 307200 | /path/to/xxx.wav | 16000 | placeholder | yyy |
Download the audios (including [AudioCaps](https://drive.google.com/file/d/16J1CVu7EZPD_22FxitZ0TpOd__FwzOmx/view?usp=drive_link), [VGGSound](https://huggingface.co/datasets/Loie/VGGSound), [AudioSet](https://huggingface.co/datasets/agkphysics/AudioSet), [WavCaps](ttps://huggingface.co/datasets/cvssp/WavCaps), [Clotho](https://zenodo.org/records/3490684), [ESC50](https://github.com/karolpiczak/ESC-50?tab=readme-ov-file#download), [MACS](https://zenodo.org/records/2589280), [UrbanSound8K](https://urbansounddataset.weebly.com/urbansound8k.html), [MusicInstrument](https://www.kaggle.com/datasets/soumendraprasad/musical-instruments-sound-dataset), [GTZAN](https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification), etc.), and put them into the same folder `/path/to/audios`. Follow the commands to automatically generate a `train_audio.csv` for configuration:
```bash
ROOT_AUDIO="/path/to/audios"
ROOT_META="./data/meta/audio"
# 1.1 Create a meta file from a unified audio folder. This should output ${ROOT_META}/meta.csv
python -m tools.datasets.convert audio ${ROOT_AUDIO} --output ${ROOT_META}/meta.csv
# 1.2 Get audio information. This should output ${ROOT_META}/meta_ainfo.csv
python -m tools.datasets.datautil ${ROOT_META}/meta.csv --audio-info
# 2.1 Trim audios within 30 seconds. This should overwrite the raw audios by default and output ${ROOT_META}/meta_ainfo_trim30s.csv
python -m tools.datasets.datautil ${ROOT_META}/audio_meta.csv --trim-audio 30
# 2.2 Unify the sample rate to 16k Hz for all audios. This should output ${ROOT_META}/audio_meta_trim30s_sr16000.csv
python -m tools.datasets.datautil ${ROOT_META}/meta_ainfo_trim30s.csv --resample-audio --audio-sr 16000
# 3.1 Set dummy videos. This should output ${ROOT_META}/audio_meta_trim30s_sr16000_dummy_videos.csv
python -m tools.datasets.datautil ${ROOT_META}/audio_meta_trim30s_sr16000.csv --dummy-video
# 3.2 Get training meta csv. This should output ${ROOT_META}/train_audio.csv
python -m tools.datasets.find_audio_ds all \
--data_root ${ROOT_AUDIO} \
--meta_file ${ROOT_META}/audio_meta_trim30s_sr16000_dummy_videos.csv \
--save_file ${ROOT_META}/train_audio.csv
```
### Stage2 - Audio-Video SFT
Here we provide an example with [TAVGBench](https://github.com/OpenNLPLab/TAVGBench) to prepare video-audio-text triplets for training. You can easily transfer to your own datasets.
| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---|
| /path/to/xxx.mp4 | xxx | xxx.mp4 | 160 | 480 | 640 | 0.75 | 16 | 307200 | /path/to/xxx.wav | 16000 | yyy |
The following script will automatically generate a `train_av_sft.csv` for configuration:
```bash
ROOT_VIDEO="/path/to/videos"
ROOT_META="./data/meta/video"
fmin=10 # minial frames for each video
fps=16 # for Wan2.1-1.3B
# 1.1 Create a meta file from a video folder. This should output ${ROOT_META}/meta.csv
python -m tools.datasets.convert video ${ROOT_VIDEO} --output ${ROOT_META}/meta.csv
# 1.2 Get video information and remove broken videos. This should output ${ROOT_META}/meta_info_fmin${fmin}.csv
python -m tools.datasets.datautil ${ROOT_META}/meta.csv --info --fmin ${fmin}
# 2.1 Unify FPS to 16 Hz for all videos. This will change the raw videos, and output ${ROOT_META}/meta_info_fmin${fmin}_fps${fps}.csv
python -m tools.datasets.datautil ${ROOT_META}/meta_info_fmin${fmin}.csv --uni-fps ${fps} --overwrite
# 3.1 Get training meta csv. This should output ${ROOT_META}/train_jav.csv
python -m tools.datasets.find_jav_ds tavgbench \
--meta_src /path/to/TAVGBench/release_captions.txt \
--meta_file ${ROOT_META}/meta_info_fmin${fmin}_fps${fps}.csv \
--save_file ${ROOT_META}/train_av_sft.csv
```
If you get multiple data sources, just merge the csv files to a single one:
```bash
python -m tools.datasets.datautil ds1.csv ds2.csv ... --output /path/to/output.csv
```
### Stage3 - Audio-Video DPO
To run DPO, you need to prepare a data pool isolated from the SFT training data, organized into a `train_av_dpo_raw.csv`:
| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---|
| /path/to/xxx.mp4 | xxx | xxx.mp4 | 160 | 480 | 640 | 0.75 | 16 | 307200 | /path/to/xxx.wav | 16000 | yyy |
Then, run inference to generate N (e.g., 3) audio-video samples for each prompt input:
```bash
model_path="/path/to/av_sft_ckpt"
src_meta_path="/path/to/train_av_dpo_raw.csv"
save_dir="/path/to/dpo_gen"
resolution=480p # or 240p
num_frames=81 # 5s
aspect_ratio="9:16"
num_sample=3
cfg_file="configs/wan2.1/inference/sample.py"
torchrun --standalone --nproc_per_node 8 \
scripts/inference.py \
${cfg_file} \
--resolution ${resolution} --num-frames ${num_frames} --aspect-ratio ${aspect_ratio} \
--prompt-path ${src_meta_path} --model-path ${model_path} --num-sample ${num_sample}$ \
--save-dir ${save_dir} --verbose 1
```
Next, gather 1 ground-truth audio-video pair with 3 generated audio-video pair for each prompt input, and :
```bash
src_meta_path="/path/to/train_av_dpo_raw.csv"
data_dir="/path/to/dpo_gen"
gen_meta_path="/path/to/train_av_dpo_gen.csv"
# first, create a meta file and extract audios from videos
python -m tools.datasets.convert video ${data_dir} --output ${data_dir}/meta.csv
python -m tools.datasets.datautil ${data_dir}/meta.csv --info --fmin 1
python -m tools.datasets.datautil ${ROOT_META}/meta_info_fmin1.csv --extract-audio --audio-sr 16000
# second, gather the generated audio-video pairs
python -m tools.datasets.process_dpo \
--task gather_dpo_gen \
--src_meta_path ${src_meta_path}$ \
--tgt_meta_path ${ROOT_META}/meta_info_fmin1_au_sr16000.csv \
--out_meta_path ${gen_meta_path}$
```
Then, score the 1+3 candidates for modality-aware rewarding, and the results will be saved at `./evaluation_results/audio_video_dpo_reward/avdpo_gen_avreward.csv`.
```bash
gen_meta_path="/path/to/train_av_dpo_gen.csv"
res_dir="./evaluation_results/audio_video_dpo_reward"
METRICS="av-reward"
MAX_AUDIO_LEN_S=5.0
export CUDA_VISIBLE_DEVICES="0"
torchrun --nproc_per_node=1 -m eval.javisbench.main \
--input_file ${gen_meta_path} \
--output_file "${res_dir}/avdpo_gen.json" \
--max_audio_len_s ${MAX_AUDIO_LEN_S} \
--metrics ${METRICS}
```
Finally, ranking the generated samples and select the chosen-reject (or win-lose) sample pairs for DPO training:
```bash
src_meta_path="./evaluation_results/audio_video_dpo_reward/avdpo_gen_avreward.csv"
out_meta_path="./data/meta/avdpo/train_av_dpo.csv"
python -m tools.datasets.process_dpo \
--task rank_dpo_pair \
--src_meta_path ${src_meta_path}$ \
--out_meta_path ${out_meta_path}$
``` |