Audio Samples for "WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution"

Abstract: Audio super-resolution is the task of constructing a high-resolution (HR) audio from a low-resolution (LR) audio that contains a fraction of the original waveform samples. Previous methods based on convolutional neural networks and mean squared error training objective have relatively low performance, while adversarial generative models are difficult to train and tune. Recently, normalizing flow has attracted a lot of attention for its high performance, simple training and fast inference. In this paper, we propose WSRGlow, a Glow-based waveform generative model to perform audio super-resolution. Specifically, 1) we integrate WaveNet and Glow to directly maximize the exact likelihood of the target HR audio conditioned on LR information; and 2) to exploit the audio information from low-resolution audio, we propose an LR audio encoder and an STFT encoder, which encode the LR information from the time domain and frequency domain respectively. The experimental results show that the proposed model is easier to train and outperforms the previous works in terms of both objective and perceptual quality. Audio samples are available at https://zkx06111.github.io/wsrglow/.



This page contains a set of audio samples to support the paper; we suggest that the reader listen to the samples when reading the paper.
All utterances were unseen during training, and the results are the first 5 utterances of the speaker p225 in VCTK corpus.

Section Ⅰ: Examples for 2x Super-Resolution (upsampled from 24kHz to 48kHz)

This section contains examples for the speaker “p225" from the VCTK dataset. The upsampling rate is 2 (from 24kHz to 48kHz).

Original low resolution (24 kHz) Original high resolution (48 kHz) U-Net (48 kHz) MU-GAN (48 kHz) WSRGlow (48 kHz)

Section II: Examples for 4x Super-Resolution (upsampled from 12kHz to 48kHz)

This section contains examples for the speaker “p225" from the VCTK dataset. The upsampling rate is 4 (from 12kHz to 48kHz). All models but WSRGlow (50000) are trained with the same step size for 100000 iterations.

Original low resolution (12 kHz) Original high resolution (48 kHz) U-Net (48 kHz) MU-GAN (48 kHz) WSRGlow (50000 iters) (48 kHz) WSRGlow (100000 iters) (48 kHz)

Section III: Examples for 4x Super-Resolution using WSRGlow with different inference temparatures

This section contains examples for the speaker “p225" from the VCTK dataset. The upsampling rate is 4 (from 12kHz to 48kHz).
The same WSRGlow model is set to different temperatures T=0.5,0.8,1.0 during inference.

Original low resolution (12 kHz) Original high resolution (48 kHz) WSRGlow, T=0.5 (48 kHz) WSRGlow, T=0.8 (48 kHz) WSRGlow, T=1 (48 kHz)