Abstract:
Audio super-resolution is the task of constructing a high-resolution (HR) audio from a low-resolution (LR) audio that
contains a fraction of the original waveform samples. Previous methods based on convolutional neural networks and mean
squared error training objective have relatively low performance, while adversarial generative models are difficult to
train and tune. Recently, normalizing flow has attracted a lot of attention for its high performance, simple training
and fast inference. In this paper, we propose WSRGlow, a Glow-based waveform generative model to perform audio
super-resolution. Specifically, 1) we integrate WaveNet and Glow to directly maximize the exact likelihood of the target
HR audio conditioned on LR information; and 2) to exploit the audio information from low-resolution audio, we propose an
LR audio encoder and an STFT encoder, which encode the LR information from the time domain and frequency domain
respectively. The experimental results show that the proposed model is easier to train and outperforms the previous
works in terms of both objective and perceptual quality. Audio samples are available at
https://zkx06111.github.io/wsrglow/.
This page contains a set of audio samples to support the paper; we suggest that the reader listen to the samples when reading the paper.
All utterances were unseen during training, and the results are the first 5 utterances of the speaker p225 in VCTK corpus.