File size: 9,644 Bytes

e490e7e


## Data Preparation for JavisDiT


### Stage1 - JavisDiT-audio


In this stage, we only need audio files to initialize the audio generation capability:

| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text| audio_text|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| placeholder.mp4 | xxx | xxx.mp4 | 240 | 480 | 640 | 0.75 | 24 | 307200 | /path/to/xxx.wav | 16000 | placeholder | yyy |

Download the audios (including [AudioCaps](https://drive.google.com/file/d/16J1CVu7EZPD_22FxitZ0TpOd__FwzOmx/view?usp=drive_link), [VGGSound](https://huggingface.co/datasets/Loie/VGGSound), [AudioSet](https://huggingface.co/datasets/agkphysics/AudioSet), [WavCaps](ttps://huggingface.co/datasets/cvssp/WavCaps), [Clotho](https://zenodo.org/records/3490684), [ESC50](https://github.com/karolpiczak/ESC-50?tab=readme-ov-file#download), [MACS](https://zenodo.org/records/2589280), [UrbanSound8K](https://urbansounddataset.weebly.com/urbansound8k.html), [MusicInstrument](https://www.kaggle.com/datasets/soumendraprasad/musical-instruments-sound-dataset), [GTZAN](https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification), etc.), and put them into the same folder `/path/to/audios`. Follow the commands to automatically generate a `train_audio.csv` for configuration:

```bash
ROOT_AUDIO="/path/to/audios"
ROOT_META="./data/meta/audio"

# 1.1 Create a meta file from a unified audio folder. This should output ${ROOT_META}/meta.csv
python -m tools.datasets.convert audio ${ROOT_AUDIO} --output ${ROOT_META}/meta.csv

# 1.2 Get audio information. This should output ${ROOT_META}/meta_ainfo.csv
python -m tools.datasets.datautil ${ROOT_META}/meta.csv --audio-info

# 2.1 Trim audios within 30 seconds. This should overwrite the raw audios by default and output ${ROOT_META}/meta_ainfo_trim30s.csv
python -m tools.datasets.datautil ${ROOT_META}/audio_meta.csv --trim-audio 30

# 2.2 Unify the sample rate to 16k Hz for all audios. This should output ${ROOT_META}/audio_meta_trim30s_sr16000.csv
python -m tools.datasets.datautil ${ROOT_META}/meta_ainfo_trim30s.csv --resample-audio --audio-sr 16000

# 3.1 Set dummy videos. This should output ${ROOT_META}/audio_meta_trim30s_sr16000_dummy_videos.csv
python -m tools.datasets.datautil ${ROOT_META}/audio_meta_trim30s_sr16000.csv --dummy-video

# 3.2 Get training meta csv. This should output ${ROOT_META}/train_audio.csv
python -m tools.datasets.find_audio_ds all \
    --data_root ${ROOT_AUDIO} \
    --meta_file ${ROOT_META}/audio_meta_trim30s_sr16000_dummy_videos.csv \
    --save_file ${ROOT_META}/train_audio.csv
```

### Stage2 - JavisDiT-prior

As detailed in our [paper](https://arxiv.org/pdf/2503.23377), the prior estimator is trainning with the contrastive learning paradigm. 
We take the extracted spatio-temporal priors as **anchor**, view the paired audio-video in the training datasets as **positive samples**, and randomly augment the audio or video to construct asychronized audio-video pairs as **negative samples**.
In particular, saptial- and temporal-asynchronization are separately generated.

| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text | unpaired_audio_path |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| /path/to/xxx.mp4 | xxx | xxx.mp4 | 240 | 480 | 640 | 0.75 | 24 | 307200 | /path/to/xxx.wav | 16000 | yyy | /path/to/zzz.wav |

#### Ground-truth synchronized audio-video pairs

Follow the instructions in [Stage3](#stage3---javisdit-jav) to read the audio-video information from training dataset (eg, [TAVGBench](https://github.com/OpenNLPLab/TAVGBench)).
The obtained basic meta file can be `/path/to/train_jav.csv`.


#### Offline asynchronized audio generation

Given a synchronized audio-video pair, we efficiently construct asynchronized audio-video pairs by generating standalone audios from [AudioLDM2](https://github.com/haoheliu/AudioLDM2) without reference videos. 
The native text descrption, native video, generated audio jointly contribute to an asynchronized (negative) sample for contrastive learning.
Generated audio paths will be recorded in the `unpaired_audio_path` column.

```bash
ROOT_META="./data/meta/prior"

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4\
    tools/st_prior/gen_unpaired_audios.py \
    --input_meta ${ROOT_META}/train_jav.csv \
    --output_dir ./data/st_prior/audio/unpaired \
    --output_meta ${ROOT_META}/train_prior.csv \
    --match_duration
```

#### Online asynchronized audio-video augmentation

This part is implemented in `javisdit/datasets/augment.py`, where we developed various spatial/temporal augmentations for video/audio samples independently to constructing spatially/temporally asynchronized audio-video pairs.
For implementation details please kindly refer to our [paper](https://arxiv.org/pdf/2503.23377) and [code](javisdit/datasets/augment.py), and here we introduce the data preparation to perform corresponding augmentations:

- Auxiliary Video Resource ([SA-V](https://ai.meta.com/datasets/segment-anything-video/))

For video spatial augmentation, one of the efficient approaches is to randomly adding a sounding-object's masklet into a video sequence, causing spatial asynchrony between video and audio pairs.
Here we take the training set of [SA-V](https://ai.meta.com/datasets/segment-anything-video/) to collect native object maskelets at 6fps:

```
data/st_prior/video/SA_V/
├── sav_train
│   ├── sav_000
│   ├── sav_001
│   └── sav_002
```

Then, we utilize [GroundedSAM](https://github.com/zhengyuhang123/GroundedSAM.git) to extend 6fps annotations to 24fps masklets:

```bash
mkdir third_party && cd third_party

git clone https://github.com/zhengyuhang123/GroundedSAM.git

cd GroundedSAM

export AM_I_DOCKER=False
export BUILD_WITH_CUDA=True

python -m pip install -e segment_anything
pip install --no-build-isolation -e GroundingDINO

wget -P EfficientSAM/ https://github.com/THU-MIG/RepViT/releases/download/v1.0/repvit_sam.pt

cd ../../

ls data/st_prior/video/SA_V/sav_train/sav_*/*.mp4 > data/st_prior/video/SA_V/sa_v_list.txt

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 \
    tools/st_prior/get_masklets.py \
    --data_path data/st_prior/video/SA_V/sa_v_list.txt \
    --output_dir data/st_prior/video/SA_V/crops

ls data/st_prior/video/SA_V/crops/*.json > data/st_prior/video/SA_V/crops/pool_list.txt
```

The exracted masklets will be stored as:
```
data/st_prior/video/SA_V/crops/
├── pool_list.txt
├── sav_000001_mask_000.mp4
├── sav_000001_masklet_000.mp4
├── sav_000001_meta_000.json
├── sav_000002_mask_000.mp4
├── sav_000002_mask_001.mp4
├── sav_000002_masklet_000.mp4
├── sav_000002_masklet_001.mp4
├── sav_000002_meta_000.json
├── sav_000002_meta_001.json
├── ...
```

- Auxiliary Audio Resource ([AudioSep](https://github.com/Audio-AGI/AudioSep))

After seperating audio sources from original audio files, we can apply arbitrary addition and deletion operations on audios to introduce spatial asynchrony between video and audio pairs:

```bash
cd third_party
git clone https://github.com/Audio-AGI/AudioSep.git

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 \
    tools/st_prior/sep_audios.py \
    --audio_path /path/to/TAVGBench \
    --output_path ./data/st_prior/audio/TAVGBench

ls data/st_prior/audio/TAVGBench/*.wav > data/st_prior/audio/TAVGBench/pool_list.txt
```


### Stage3 - JavisDiT-jav

Here we provide an example with [TAVGBench](https://github.com/OpenNLPLab/TAVGBench) to prepare video-audio-text triplets for training. You can easily transfer to your own datasets.

| path | id | relpath | num_frames | height | width | aspect_ratio | fps | resolution | audio_path | audio_fps | text|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---|
| /path/to/xxx.mp4 | xxx | xxx.mp4 | 240 | 480 | 640 | 0.75 | 24 | 307200 | /path/to/xxx.wav | 16000 | yyy |

The following script will automatically generate a `train_jav.csv` for configuration:

```bash
ROOT_VIDEO="/path/to/videos"
ROOT_META="./data/meta/TAVGBench"

fmin=10  # minial frames for each video

# 1.1 Create a meta file from a video folder. This should output ${ROOT_META}/meta.csv
python -m tools.datasets.convert video ${ROOT_VIDEO} --output ${ROOT_META}/meta.csv

# 1.2 Get video information and remove broken videos. This should output ${ROOT_META}/meta_info_fmin${fmin}.csv
python -m tools.datasets.datautil ${ROOT_META}/meta.csv --info --fmin ${fmin}

# 2.1 Unify FPS to 24 Hz for all videos. This will change the raw videos, and output ${ROOT_META}/meta_info_fmin${fmin}_fps24.csv
python -m tools.datasets.datautil ${ROOT_META}/meta_info_fmin${fmin}.csv --uni-fps 24 --overwrite

# 2.2 Extract audios from videos, and fix the sample rate to 16k Hz for all audios. This should output ${ROOT_META}/meta_info_fmin${fmin}_fps24_au_sr16000.csv
python -m tools.datasets.datautil ${ROOT_META}/meta_info_fmin${fmin}_fps24.csv --extract-audio --audio-sr 16000

# 3.1 Get training meta csv. This should output ${ROOT_META}/train_jav.csv
python -m tools.datasets.find_jav_ds tavgbench \
    --meta_src /path/to/TAVGBench/release_captions.txt \
    --meta_file ${ROOT_META}/meta_info_fmin${fmin}_fps24_au_sr16000.csv \
    --save_file ${ROOT_META}/train_jav.csv
```

If you get multiple data sources, just merge the csv files to a single one:
```bash
python -m tools.datasets.datautil ds1.csv ds2.csv ... --output /path/to/output.csv
```