File size: 9,644 Bytes
e490e7e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
## Data Preparation for JavisDiT
### Stage1 - JavisDiT-audio
In this stage, we only need audio files to initialize the audio generation capability:
| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text| audio_text|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| placeholder.mp4 | xxx | xxx.mp4 | 240 | 480 | 640 | 0.75 | 24 | 307200 | /path/to/xxx.wav | 16000 | placeholder | yyy |
Download the audios (including [AudioCaps](https://drive.google.com/file/d/16J1CVu7EZPD_22FxitZ0TpOd__FwzOmx/view?usp=drive_link), [VGGSound](https://huggingface.co/datasets/Loie/VGGSound), [AudioSet](https://huggingface.co/datasets/agkphysics/AudioSet), [WavCaps](ttps://huggingface.co/datasets/cvssp/WavCaps), [Clotho](https://zenodo.org/records/3490684), [ESC50](https://github.com/karolpiczak/ESC-50?tab=readme-ov-file#download), [MACS](https://zenodo.org/records/2589280), [UrbanSound8K](https://urbansounddataset.weebly.com/urbansound8k.html), [MusicInstrument](https://www.kaggle.com/datasets/soumendraprasad/musical-instruments-sound-dataset), [GTZAN](https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification), etc.), and put them into the same folder `/path/to/audios`. Follow the commands to automatically generate a `train_audio.csv` for configuration:
```bash
ROOT_AUDIO="/path/to/audios"
ROOT_META="./data/meta/audio"
# 1.1 Create a meta file from a unified audio folder. This should output ${ROOT_META}/meta.csv
python -m tools.datasets.convert audio ${ROOT_AUDIO} --output ${ROOT_META}/meta.csv
# 1.2 Get audio information. This should output ${ROOT_META}/meta_ainfo.csv
python -m tools.datasets.datautil ${ROOT_META}/meta.csv --audio-info
# 2.1 Trim audios within 30 seconds. This should overwrite the raw audios by default and output ${ROOT_META}/meta_ainfo_trim30s.csv
python -m tools.datasets.datautil ${ROOT_META}/audio_meta.csv --trim-audio 30
# 2.2 Unify the sample rate to 16k Hz for all audios. This should output ${ROOT_META}/audio_meta_trim30s_sr16000.csv
python -m tools.datasets.datautil ${ROOT_META}/meta_ainfo_trim30s.csv --resample-audio --audio-sr 16000
# 3.1 Set dummy videos. This should output ${ROOT_META}/audio_meta_trim30s_sr16000_dummy_videos.csv
python -m tools.datasets.datautil ${ROOT_META}/audio_meta_trim30s_sr16000.csv --dummy-video
# 3.2 Get training meta csv. This should output ${ROOT_META}/train_audio.csv
python -m tools.datasets.find_audio_ds all \
--data_root ${ROOT_AUDIO} \
--meta_file ${ROOT_META}/audio_meta_trim30s_sr16000_dummy_videos.csv \
--save_file ${ROOT_META}/train_audio.csv
```
### Stage2 - JavisDiT-prior
As detailed in our [paper](https://arxiv.org/pdf/2503.23377), the prior estimator is trainning with the contrastive learning paradigm.
We take the extracted spatio-temporal priors as **anchor**, view the paired audio-video in the training datasets as **positive samples**, and randomly augment the audio or video to construct asychronized audio-video pairs as **negative samples**.
In particular, saptial- and temporal-asynchronization are separately generated.
| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text | unpaired_audio_path |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| /path/to/xxx.mp4 | xxx | xxx.mp4 | 240 | 480 | 640 | 0.75 | 24 | 307200 | /path/to/xxx.wav | 16000 | yyy | /path/to/zzz.wav |
#### Ground-truth synchronized audio-video pairs
Follow the instructions in [Stage3](#stage3---javisdit-jav) to read the audio-video information from training dataset (eg, [TAVGBench](https://github.com/OpenNLPLab/TAVGBench)).
The obtained basic meta file can be `/path/to/train_jav.csv`.
#### Offline asynchronized audio generation
Given a synchronized audio-video pair, we efficiently construct asynchronized audio-video pairs by generating standalone audios from [AudioLDM2](https://github.com/haoheliu/AudioLDM2) without reference videos.
The native text descrption, native video, generated audio jointly contribute to an asynchronized (negative) sample for contrastive learning.
Generated audio paths will be recorded in the `unpaired_audio_path` column.
```bash
ROOT_META="./data/meta/prior"
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4\
tools/st_prior/gen_unpaired_audios.py \
--input_meta ${ROOT_META}/train_jav.csv \
--output_dir ./data/st_prior/audio/unpaired \
--output_meta ${ROOT_META}/train_prior.csv \
--match_duration
```
#### Online asynchronized audio-video augmentation
This part is implemented in `javisdit/datasets/augment.py`, where we developed various spatial/temporal augmentations for video/audio samples independently to constructing spatially/temporally asynchronized audio-video pairs.
For implementation details please kindly refer to our [paper](https://arxiv.org/pdf/2503.23377) and [code](javisdit/datasets/augment.py), and here we introduce the data preparation to perform corresponding augmentations:
- Auxiliary Video Resource ([SA-V](https://ai.meta.com/datasets/segment-anything-video/))
For video spatial augmentation, one of the efficient approaches is to randomly adding a sounding-object's masklet into a video sequence, causing spatial asynchrony between video and audio pairs.
Here we take the training set of [SA-V](https://ai.meta.com/datasets/segment-anything-video/) to collect native object maskelets at 6fps:
```
data/st_prior/video/SA_V/
βββ sav_train
β βββ sav_000
β βββ sav_001
β βββ sav_002
```
Then, we utilize [GroundedSAM](https://github.com/zhengyuhang123/GroundedSAM.git) to extend 6fps annotations to 24fps masklets:
```bash
mkdir third_party && cd third_party
git clone https://github.com/zhengyuhang123/GroundedSAM.git
cd GroundedSAM
export AM_I_DOCKER=False
export BUILD_WITH_CUDA=True
python -m pip install -e segment_anything
pip install --no-build-isolation -e GroundingDINO
wget -P EfficientSAM/ https://github.com/THU-MIG/RepViT/releases/download/v1.0/repvit_sam.pt
cd ../../
ls data/st_prior/video/SA_V/sav_train/sav_*/*.mp4 > data/st_prior/video/SA_V/sa_v_list.txt
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 \
tools/st_prior/get_masklets.py \
--data_path data/st_prior/video/SA_V/sa_v_list.txt \
--output_dir data/st_prior/video/SA_V/crops
ls data/st_prior/video/SA_V/crops/*.json > data/st_prior/video/SA_V/crops/pool_list.txt
```
The exracted masklets will be stored as:
```
data/st_prior/video/SA_V/crops/
βββ pool_list.txt
βββ sav_000001_mask_000.mp4
βββ sav_000001_masklet_000.mp4
βββ sav_000001_meta_000.json
βββ sav_000002_mask_000.mp4
βββ sav_000002_mask_001.mp4
βββ sav_000002_masklet_000.mp4
βββ sav_000002_masklet_001.mp4
βββ sav_000002_meta_000.json
βββ sav_000002_meta_001.json
βββ ...
```
- Auxiliary Audio Resource ([AudioSep](https://github.com/Audio-AGI/AudioSep))
After seperating audio sources from original audio files, we can apply arbitrary addition and deletion operations on audios to introduce spatial asynchrony between video and audio pairs:
```bash
cd third_party
git clone https://github.com/Audio-AGI/AudioSep.git
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 \
tools/st_prior/sep_audios.py \
--audio_path /path/to/TAVGBench \
--output_path ./data/st_prior/audio/TAVGBench
ls data/st_prior/audio/TAVGBench/*.wav > data/st_prior/audio/TAVGBench/pool_list.txt
```
### Stage3 - JavisDiT-jav
Here we provide an example with [TAVGBench](https://github.com/OpenNLPLab/TAVGBench) to prepare video-audio-text triplets for training. You can easily transfer to your own datasets.
| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---|
| /path/to/xxx.mp4 | xxx | xxx.mp4 | 240 | 480 | 640 | 0.75 | 24 | 307200 | /path/to/xxx.wav | 16000 | yyy |
The following script will automatically generate a `train_jav.csv` for configuration:
```bash
ROOT_VIDEO="/path/to/videos"
ROOT_META="./data/meta/TAVGBench"
fmin=10 # minial frames for each video
# 1.1 Create a meta file from a video folder. This should output ${ROOT_META}/meta.csv
python -m tools.datasets.convert video ${ROOT_VIDEO} --output ${ROOT_META}/meta.csv
# 1.2 Get video information and remove broken videos. This should output ${ROOT_META}/meta_info_fmin${fmin}.csv
python -m tools.datasets.datautil ${ROOT_META}/meta.csv --info --fmin ${fmin}
# 2.1 Unify FPS to 24 Hz for all videos. This will change the raw videos, and output ${ROOT_META}/meta_info_fmin${fmin}_fps24.csv
python -m tools.datasets.datautil ${ROOT_META}/meta_info_fmin${fmin}.csv --uni-fps 24 --overwrite
# 2.2 Extract audios from videos, and fix the sample rate to 16k Hz for all audios. This should output ${ROOT_META}/meta_info_fmin${fmin}_fps24_au_sr16000.csv
python -m tools.datasets.datautil ${ROOT_META}/meta_info_fmin${fmin}_fps24.csv --extract-audio --audio-sr 16000
# 3.1 Get training meta csv. This should output ${ROOT_META}/train_jav.csv
python -m tools.datasets.find_jav_ds tavgbench \
--meta_src /path/to/TAVGBench/release_captions.txt \
--meta_file ${ROOT_META}/meta_info_fmin${fmin}_fps24_au_sr16000.csv \
--save_file ${ROOT_META}/train_jav.csv
```
If you get multiple data sources, just merge the csv files to a single one:
```bash
python -m tools.datasets.datautil ds1.csv ds2.csv ... --output /path/to/output.csv
``` |