Papers
arxiv:2509.00683

PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description

Published on Aug 31
Authors:
,
,
Wen Wu ,
,

Abstract

While recent work in controllable text-to-audio (TTA) generation has achieved fine-grained control through timestamp conditioning, its scope remains limited by audio quality and input format. These models often suffer from poor audio quality in real datasets due to sole reliance on synthetic data. Moreover, some models are constrained to a closed vocabulary of sound events, preventing them from controlling audio generation for open-ended, free-text queries. This paper introduces PicoAudio2, a framework that advances temporal-controllable TTA by mitigating these data and architectural limitations. Specifically, we use a grounding model to annotate event timestamps of real audio-text datasets to curate temporally-strong real data, in addition to simulation data from existing works. The model is trained on the combination of real and simulation data. Moreover, we propose an enhanced architecture that integrates the fine-grained information from a timestamp matrix with coarse-grained free-text input. Experiments show that PicoAudio2 exhibits superior performance in terms of temporal controllability and audio quality.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.00683 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.00683 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.00683 in a Space README.md to link it from this page.

Collections including this paper 1