File size: 6,202 Bytes
70d8fcf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
title: SongFormer
emoji: 🎡
colorFrom: blue
colorTo: indigo
sdk: gradio
python_version: "3.10"
app_file: app.py
tags:
  - music-structure-annotation
  - transformer
short_description: State-of-the-art music analysis with multi-scale datasets
fullWidth: true
---

<p align="center">
  <img src="figs/logo.png" width="50%" />
</p>


# SONGFORMER: SCALING MUSIC STRUCTURE ANALYSIS WITH HETEROGENEOUS SUPERVISION

![Python](https://img.shields.io/badge/Python-3.10-brightgreen)  
![License](https://img.shields.io/badge/License-CC%20BY%204.0-lightblue)  
[![arXiv](https://img.shields.io/badge/arXiv-com.svg?logo=arXiv)]()  
[![GitHub](https://img.shields.io/badge/GitHub-SongFormer-black)](https://github.com/ASLP-lab/SongFormer)  
[![HuggingFace Space](https://img.shields.io/badge/HuggingFace-space-yellow)](https://huggingface.co/spaces/ASLP-lab/SongFormer)  
[![HuggingFace Model](https://img.shields.io/badge/HuggingFace-model-blue)](https://huggingface.co/ASLP-lab/SongFormer)  
[![Dataset SongFormDB](https://img.shields.io/badge/HF%20Dataset-SongFormDB-green)](https://huggingface.co/datasets/ASLP-lab/SongFormDB)  
[![Dataset SongFormBench](https://img.shields.io/badge/HF%20Dataset-SongFormBench-orange)](https://huggingface.co/datasets/ASLP-lab/SongFormBench)
[![Discord](https://img.shields.io/badge/Discord-join%20us-purple?logo=discord&logoColor=white)](https://discord.gg/rwcqh7Em)
[![lab](https://img.shields.io/badge/🏫-ASLP-grey?labelColor=lightgrey)](http://www.npu-aslp.org/)

Chunbo Hao<sup>&ast;</sup>, Ruibin Yuan<sup>&ast;</sup>, Jixun Yao, Qixin Deng, Xinyi Bai, Wei Xue, Lei Xie<sup>&dagger;</sup>


----


SongFormer is a music structure analysis framework that leverages multi-resolution self-supervised representations and heterogeneous supervision, accompanied by the large-scale multilingual dataset SongFormDB and the high-quality benchmark SongFormBench to foster fair and reproducible research.

![](figs/songformer.png)

## News and Updates

## πŸ“‹ To-Do List

- [x] Complete and push inference code to GitHub
- [x] Upload model checkpoint(s) to Hugging Face Hub
- [ ] Upload the paper to arXiv
- [x] Fix readme
- [ ] Deploy an out-of-the-box inference version on Hugging Face (via Inference API or Spaces)
- [ ] Publish the package to PyPI for easy installation via `pip`
- [ ] Open-source evaluation code
- [ ] Open-source training code

## Installation

### Setting up Python Environment

```bash
git clone https://github.com/ASLP-lab/SongFormer.git

# Get MuQ and MusicFM source code
git submodule update --init --recursive

conda create -n songformer python=3.10 -y
conda activate songformer
```

For users in mainland China, you may need to set up pip mirror source:

```bash
pip config set global.index-url https://pypi.mirrors.ustc.edu.cn/simple
```

Install dependencies:

```bash
pip install -r requirements.txt
```

We tested this on Ubuntu 22.04.1 LTS and it works normally. If you cannot install, you may need to remove version constraints in `requirements.txt`

### Download Pre-trained Models

```bash
cd src/SongFormer
# For users in mainland China, you can modify according to the py file instructions to use hf-mirror.com for downloading
python utils/fetch_pretrained.py
```

After downloading, you can verify the md5sum values in `src/SongFormer/ckpts/MusicFM/md5sum.txt` match the downloaded files:

```bash
md5sum ckpts/MusicFM/msd_stats.json
md5sum ckpts/MusicFM/pretrained_msd.pt
md5sum ckpts/SongFormer.safetensors
# md5sum ckpts/SongFormer.pt
```

## Inference

## Inference

### 1. One-Click Inference with HuggingFace Space (coming soon)

Available at: [https://huggingface.co/spaces/ASLP-lab/SongFormer](https://huggingface.co/spaces/ASLP-lab/SongFormer)

### 2. Gradio App

First, cd to the project root directory and activate the environment:

```bash
conda activate songformer
```

You can modify the server port and listening address in the last line of `app.py` according to your preference.

> If you're using an HTTP proxy, please ensure you include:
>
> ```bash
> export no_proxy="localhost, 127.0.0.1, ::1"
> export NO_PROXY="localhost, 127.0.0.1, ::1"
> ```
>
> Otherwise, Gradio may incorrectly assume the service hasn't started, causing startup to exit directly.

When first running `app.py`, it will connect to Hugging Face to download MuQ-related weights. We recommend creating an empty folder in an appropriate location and using `export HF_HOME=XXX` to point to this folder, so cache will be stored there for easy cleanup and transfer.

And for users in mainland China, you may need `export HF_ENDPOINT=https://hf-mirror.com`. For details, refer to https://hf-mirror.com/

```bash
python app.py
```

### 3. Python Code

You can refer to the file `src/SongFormer/infer/infer.py`. The corresponding execution script is located at `src/SongFormer/infer.sh`. This is a ready-to-use, single-machine, multi-process annotation script.

Below are some configurable parameters from the `src/SongFormer/infer.sh` script. You can set `CUDA_VISIBLE_DEVICES` to specify which GPUs to use:

```bash
-i              # Input SCP folder path, each line containing the absolute path to one audio file
-o              # Output directory for annotation results
--model         # Annotation model; the default is 'SongFormer', change if using a fine-tuned model
--checkpoint    # Path to the model checkpoint file
--config_pat    # Path to the configuration file
-gn             # Total number of GPUs to use β€” should match the number specified in CUDA_VISIBLE_DEVICES
-tn             # Number of processes to run per GPU
```

You can control which GPUs are used by setting the `CUDA_VISIBLE_DEVICES` environment variable.

### 4. CLI Inference

Coming soon

### 4. Pitfall

- You may need to modify line 121 in `src/third_party/musicfm/model/musicfm_25hz.py` to:
`S = torch.load(model_path, weights_only=False)["state_dict"]`

## Training

## Citation

If our work and codebase is useful for you, please cite as:

````
comming soon
````
## License

Our code is released under CC-BY-4.0 License.

## Contact Us


<p align="center">
    <a href="http://www.nwpu-aslp.org/">
        <img src="figs/aslp.png" width="400"/>
    </a>
</p>