DualSpec.github.io

DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model

Abstract

Text-to-audio (TTA), which generates audio signals from textual descriptions, has received huge attention in recent years. However, recent works focused on text to monaural audio only. As we know, spatial audio provides more immersive auditory experience than monaural audio, e.g. in virtual reality. To address this issue, we propose a text-to-spatial-audio (TTSA) generation framework named DualSpec.Specifically, it first trains variational autoencoders (VAEs) for extracting the latent acoustic representations from sound event audio. Then, given text that describes sound events and event directions, the proposed method uses the encoder of a pretrained large language model to transform the text into text features. Finally, it trains a diffusion model from the latent acoustic representations and text features for the spatial audio generation. In the inference stage, only the text description is needed to generate spatial audio. Particularly, to improve the synthesis quality and azimuth accuracy of the spatial sound events simultaneously, we propose to use two kinds of acoustic features. One is the Mel spectrograms which is good for improving the synthesis quality, and the other is the short-time Fourier transform spectrograms which is good at improving the azimuth accuracy. We provide a pipeline of constructing spatial audio dataset with text prompts, for the training of the VAEs and diffusion model. We also introduce new spatial-aware evaluation metrics to quantify the azimuth errors of the generated spatial audio recordings. Experimental results demonstrate that the proposed method can generate spatial audio with high directional and event consistency.

Demos

We showcase ​DualSpec’s spatial audio generation capabilities through three direction descriptors.: 1) ​Direction-of-Arrival (DOA)​, 2) ​Clock Position​, and 3) ​General Description.

Direction-of-Arrival (DOA)

woodwind & DOA 60°

walk and footsteps & DOA 270°

dog bark & DOA 180°

toilet flush & DOA 300°

pinao & DOA 0°

rooster & DOA 210°

drinking & DOA 60°

acoustic guitar & DOA 150°

glass breaking & DOA 180°

Clock Position

chirping birds & 12 o'clock direction

wind chime & 3 o'clock direction

accelerating and revving & 7 o'clock direction

clapping & 2 o'clock direction

baby crying & 10 o'clock direction

dog bark & 8 o'clock direction

guitar & 6 o'clock direction

electric guitar & 12 o'clock direction

train & 5 o'clock direction

General Description

bathhub washing & forward left, slight angle

woodwind & positioned to the rear on the left side

applause & rear, slightly to the right

street music & rear right

rooster & directly behind

cowbell & toward the front left

chirping birds & straight head

pig & right rear

knock & to the right front, slightly moving back