Speech Enhancement with Stochastic Temporal Convolutional Networks
Abstract
We consider the problem of speech modeling in speech enhancement. Recently, deep generative approaches based on variational autoencoders have been proposed to model speech spectrograms. However, these approaches are based either on hierarchical or temporal dependencies of stochastic latent variables. In this paper, we propose a generative approach to speech enhancement based on a stochastic temporal convolutional network, which combines both hierarchical and temporal dependencies of stochastic variables. We evaluate our method with real recordings of different noisy environments. The proposed speech enhancement method outperforms a previous non-sequential approach based on feed-forward fully-connected networks in terms of speech distortion, instrumental speech quality and intelligibility. At the same time, the computational cost of the proposed generative speech model remains feasible, due to inherent parallelism of the convolutional architecture.
Audio Examples
Example 1
Mixture signal
Clean speech
VAE reconstruction
STCN reconstruction
Example 2
Mixture signal
Clean speech
VAE reconstruction
STCN reconstruction
Example 3
Mixture signal
Clean speech
VAE reconstruction
STCN reconstruction
Example 4
Mixture signal
Clean speech
VAE reconstruction
STCN reconstruction
Code
The code is available at GitHub.