File size: 7,381 Bytes
e490e7e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159

## Data Preparation for JavisDiT++


### Stage1 - Audio Pretraining


In this stage, we only need audio files to initialize the audio generation capability:

| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text| audio_text|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| placeholder.mp4 | xxx | xxx.mp4 | 240 | 480 | 640 | 0.75 | 24 | 307200 | /path/to/xxx.wav | 16000 | placeholder | yyy |

Download the audios (including [AudioCaps](https://drive.google.com/file/d/16J1CVu7EZPD_22FxitZ0TpOd__FwzOmx/view?usp=drive_link), [VGGSound](https://huggingface.co/datasets/Loie/VGGSound), [AudioSet](https://huggingface.co/datasets/agkphysics/AudioSet), [WavCaps](ttps://huggingface.co/datasets/cvssp/WavCaps), [Clotho](https://zenodo.org/records/3490684), [ESC50](https://github.com/karolpiczak/ESC-50?tab=readme-ov-file#download), [MACS](https://zenodo.org/records/2589280), [UrbanSound8K](https://urbansounddataset.weebly.com/urbansound8k.html), [MusicInstrument](https://www.kaggle.com/datasets/soumendraprasad/musical-instruments-sound-dataset), [GTZAN](https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification), etc.), and put them into the same folder `/path/to/audios`. Follow the commands to automatically generate a `train_audio.csv` for configuration:

```bash
ROOT_AUDIO="/path/to/audios"
ROOT_META="./data/meta/audio"

# 1.1 Create a meta file from a unified audio folder. This should output ${ROOT_META}/meta.csv
python -m tools.datasets.convert audio ${ROOT_AUDIO} --output ${ROOT_META}/meta.csv

# 1.2 Get audio information. This should output ${ROOT_META}/meta_ainfo.csv
python -m tools.datasets.datautil ${ROOT_META}/meta.csv --audio-info

# 2.1 Trim audios within 30 seconds. This should overwrite the raw audios by default and output ${ROOT_META}/meta_ainfo_trim30s.csv
python -m tools.datasets.datautil ${ROOT_META}/audio_meta.csv --trim-audio 30

# 2.2 Unify the sample rate to 16k Hz for all audios. This should output ${ROOT_META}/audio_meta_trim30s_sr16000.csv
python -m tools.datasets.datautil ${ROOT_META}/meta_ainfo_trim30s.csv --resample-audio --audio-sr 16000

# 3.1 Set dummy videos. This should output ${ROOT_META}/audio_meta_trim30s_sr16000_dummy_videos.csv
python -m tools.datasets.datautil ${ROOT_META}/audio_meta_trim30s_sr16000.csv --dummy-video

# 3.2 Get training meta csv. This should output ${ROOT_META}/train_audio.csv
python -m tools.datasets.find_audio_ds all \
    --data_root ${ROOT_AUDIO} \
    --meta_file ${ROOT_META}/audio_meta_trim30s_sr16000_dummy_videos.csv \
    --save_file ${ROOT_META}/train_audio.csv
```

### Stage2 - Audio-Video SFT


Here we provide an example with [TAVGBench](https://github.com/OpenNLPLab/TAVGBench) to prepare video-audio-text triplets for training. You can easily transfer to your own datasets.

| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---|
| /path/to/xxx.mp4 | xxx | xxx.mp4 | 160 | 480 | 640 | 0.75 | 16 | 307200 | /path/to/xxx.wav | 16000 | yyy |

The following script will automatically generate a `train_av_sft.csv` for configuration:

```bash
ROOT_VIDEO="/path/to/videos"
ROOT_META="./data/meta/video"

fmin=10  # minial frames for each video
fps=16  # for Wan2.1-1.3B

# 1.1 Create a meta file from a video folder. This should output ${ROOT_META}/meta.csv
python -m tools.datasets.convert video ${ROOT_VIDEO} --output ${ROOT_META}/meta.csv

# 1.2 Get video information and remove broken videos. This should output ${ROOT_META}/meta_info_fmin${fmin}.csv
python -m tools.datasets.datautil ${ROOT_META}/meta.csv --info --fmin ${fmin}

# 2.1 Unify FPS to 16 Hz for all videos. This will change the raw videos, and output ${ROOT_META}/meta_info_fmin${fmin}_fps${fps}.csv
python -m tools.datasets.datautil ${ROOT_META}/meta_info_fmin${fmin}.csv --uni-fps ${fps} --overwrite

# 3.1 Get training meta csv. This should output ${ROOT_META}/train_jav.csv
python -m tools.datasets.find_jav_ds tavgbench \
    --meta_src /path/to/TAVGBench/release_captions.txt \
    --meta_file ${ROOT_META}/meta_info_fmin${fmin}_fps${fps}.csv \
    --save_file ${ROOT_META}/train_av_sft.csv
```

If you get multiple data sources, just merge the csv files to a single one:

```bash
python -m tools.datasets.datautil ds1.csv ds2.csv ... --output /path/to/output.csv
```

### Stage3 - Audio-Video DPO

To run DPO, you need to prepare a data pool isolated from the SFT training data, organized into a `train_av_dpo_raw.csv`:

| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---|
| /path/to/xxx.mp4 | xxx | xxx.mp4 | 160 | 480 | 640 | 0.75 | 16 | 307200 | /path/to/xxx.wav | 16000 | yyy |

Then, run inference to generate N (e.g., 3) audio-video samples for each prompt input:

```bash
model_path="/path/to/av_sft_ckpt"
src_meta_path="/path/to/train_av_dpo_raw.csv"
save_dir="/path/to/dpo_gen"

resolution=480p # or 240p
num_frames=81  # 5s
aspect_ratio="9:16"
num_sample=3

cfg_file="configs/wan2.1/inference/sample.py"

torchrun --standalone --nproc_per_node 8 \
    scripts/inference.py \
    ${cfg_file} \
    --resolution ${resolution} --num-frames ${num_frames} --aspect-ratio ${aspect_ratio} \
    --prompt-path ${src_meta_path} --model-path ${model_path} --num-sample ${num_sample}$ \
    --save-dir ${save_dir} --verbose 1
```

Next, gather 1 ground-truth audio-video pair with 3 generated audio-video pair for each prompt input, and :

```bash
src_meta_path="/path/to/train_av_dpo_raw.csv"
data_dir="/path/to/dpo_gen"
gen_meta_path="/path/to/train_av_dpo_gen.csv"

# first, create a meta file and extract audios from videos
python -m tools.datasets.convert video ${data_dir} --output ${data_dir}/meta.csv
python -m tools.datasets.datautil ${data_dir}/meta.csv --info --fmin 1
python -m tools.datasets.datautil ${ROOT_META}/meta_info_fmin1.csv --extract-audio --audio-sr 16000

# second, gather the generated audio-video pairs
python -m tools.datasets.process_dpo \
    --task gather_dpo_gen \
    --src_meta_path ${src_meta_path}$ \
    --tgt_meta_path ${ROOT_META}/meta_info_fmin1_au_sr16000.csv \
    --out_meta_path ${gen_meta_path}$
```

Then, score the 1+3 candidates for modality-aware rewarding, and the results will be saved at `./evaluation_results/audio_video_dpo_reward/avdpo_gen_avreward.csv`.

```bash
gen_meta_path="/path/to/train_av_dpo_gen.csv"
res_dir="./evaluation_results/audio_video_dpo_reward"

METRICS="av-reward"
MAX_AUDIO_LEN_S=5.0 

export CUDA_VISIBLE_DEVICES="0"
torchrun --nproc_per_node=1 -m eval.javisbench.main \
    --input_file ${gen_meta_path} \
    --output_file "${res_dir}/avdpo_gen.json" \
    --max_audio_len_s ${MAX_AUDIO_LEN_S} \
    --metrics ${METRICS}
```

Finally, ranking the generated samples and select the chosen-reject (or win-lose) sample pairs for DPO training:

```bash
src_meta_path="./evaluation_results/audio_video_dpo_reward/avdpo_gen_avreward.csv"
out_meta_path="./data/meta/avdpo/train_av_dpo.csv"

python -m tools.datasets.process_dpo \
    --task rank_dpo_pair \
    --src_meta_path ${src_meta_path}$ \
    --out_meta_path ${out_meta_path}$
```