WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution

Abstract: Audio super-resolution is the task of constructing a high-resolution (HR) audio from a low-resolution (LR) audio that contains a fraction of the original waveform samples. Previous methods based on convolutional neural networks and mean squared error training objective have relatively low performance, while adversarial generative models are difficult to train and tune. Recently, normalizing flow has attracted a lot of attention for its high performance, simple training and fast inference. In this paper, we propose WSRGlow, a Glow-based waveform generative model to perform audio super-resolution. Specifically, 1) we integrate WaveNet and Glow to directly maximize the exact likelihood of the target HR audio conditioned on LR information; and 2) to exploit the audio information from low-resolution audio, we propose an LR audio encoder and an STFT encoder, which encode the LR information from the time domain and frequency domain respectively. The experimental results show that the proposed model is easier to train and outperforms the previous works in terms of both objective and perceptual quality. Audio samples are available at https://zkx06111.github.io/wsrglow/.

This page contains a set of audio samples to support the paper; we suggest that the reader listen to the samples when reading the paper.
All utterances were unseen during training, and the results are the first 5 utterances of the speaker p225 in VCTK corpus.

Original low resolution (24 kHz)	Original high resolution (48 kHz)	U-Net (48 kHz)	MU-GAN (48 kHz)	WSRGlow (48 kHz)

Original low resolution (12 kHz)	Original high resolution (48 kHz)	U-Net (48 kHz)	MU-GAN (48 kHz)	WSRGlow (50000 iters) (48 kHz)	WSRGlow (100000 iters) (48 kHz)

Original low resolution (12 kHz)	Original high resolution (48 kHz)	WSRGlow, T=0.5 (48 kHz)	WSRGlow, T=0.8 (48 kHz)	WSRGlow, T=1 (48 kHz)

Audio Samples for "WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution"

Section Ⅰ: Examples for 2x Super-Resolution (upsampled from 24kHz to 48kHz)

Section II: Examples for 4x Super-Resolution (upsampled from 12kHz to 48kHz)

Section III: Examples for 4x Super-Resolution using WSRGlow with different inference temparatures