Diffusion Models for Speech Enhancement
Diffusion models have shown a great ability at bridging the performance gap between predictive and generative approaches for speech enhancement. Here, we present our research on diffusion models for speech enhancement.
Generative models follow a different learning paradigm, namely estimating and sampling from an unknown data distribution. This can be used to infer a measure of uncertainty for their predictions and to allow the generation of multiple valid estimates instead of a single best estimate as in predictive approaches. Furthermore, incorporating prior knowledge into generative models can guide the learning process and enforce desired properties about the learned distribution. In particular, diffusion models emerged as a distinct class of deep generative models that boast an impressive ability to learn complex data distributions such as that of natural images, music, and speech. Diffusion models generate data samples through iterative transformations, transitioning from e.g. a Gaussian prior distribution to a target data distribution, as visualized in Figure 1. This iterative generation scheme is formalized as a stochastic process and is parameterized with a DNN trained to address a Gaussian denoising task.
To this end, we use task-adapted diffusion which is based on the observation that---in many restoration tasks such as denoising, dereverberation, and separation---the corrupted signal \({\mathbf{y}}\) and the clean signal \({\mathbf{x}_0}\) belong to a common continuous space. One can thus define an adapted forward diffusion process whose mean interpolates between \({\mathbf{x}_0}\) and \({\mathbf{y}}\), as shown in Figure 2, where one classical diffusion process and two such task-adapted processes are depicted.
This forward process begins exactly at
\({\mathbf{x}_0}\) and ends approximately at \({\mathbf{y}}\) plus added Gaussian noise. In turn, the associated reverse process terminates at \({\mathbf{x}_0}\). To perform audio restoration, the reverse process is initialized by adding Gaussian noise to \({\mathbf{y}}\), and solved with usual diffusion model sampling schemes. In practice, we combine the task-adapted diffusion processes with direct conditioning, by providing \({\mathbf{y}}\) as an auxiliary input to the score model.
Introduction
In the past decade, speech enhancement algorithms have benefited greatly from the introduction of data-driven approaches based on deep neural networks (DNNs). Among them, a broad class of methods utilizes predictive models that learn to map a given input to a desired output. In a typical supervised setting, a predictive model is trained on a labeled dataset to minimize a certain point-wise loss function between the processed input and the clean target. Following the principle of empirical risk minimization, the goal of predictive modelling is to find a model with minimal average error over the training data, where the generalization ability of the model is usually assessed on a validation set of unseen data. By employing ever-larger models and datasets—a current trend in deep learning—strong generalization can be achieved. However, many pure data-driven approaches are considered black boxes and remain largely unexplainable and non-interpretable. Moreover, these models typically produce deterministic outputs, disregarding the inherent uncertainty in their resultsGenerative models follow a different learning paradigm, namely estimating and sampling from an unknown data distribution. This can be used to infer a measure of uncertainty for their predictions and to allow the generation of multiple valid estimates instead of a single best estimate as in predictive approaches. Furthermore, incorporating prior knowledge into generative models can guide the learning process and enforce desired properties about the learned distribution. In particular, diffusion models emerged as a distinct class of deep generative models that boast an impressive ability to learn complex data distributions such as that of natural images, music, and speech. Diffusion models generate data samples through iterative transformations, transitioning from e.g. a Gaussian prior distribution to a target data distribution, as visualized in Figure 1. This iterative generation scheme is formalized as a stochastic process and is parameterized with a DNN trained to address a Gaussian denoising task.
Conditional Generation
One of the most fundamental uses of diffusion models is to perform unsupervised learning from a finite collection of samples to learn an underlying complex data distribution. This provides the ability of unconditional generation, i.e., to generate new samples from the learned data distribution. To solve audio restoration tasks, a diffusion model must be adapted to generate audio that not only conforms to the learned clean audio distribution but, importantly, is also a plausible reconstruction of a given corrupted signal, which can be interpreted as a specific type of conditional generation.To this end, we use task-adapted diffusion which is based on the observation that---in many restoration tasks such as denoising, dereverberation, and separation---the corrupted signal \({\mathbf{y}}\) and the clean signal \({\mathbf{x}_0}\) belong to a common continuous space. One can thus define an adapted forward diffusion process whose mean interpolates between \({\mathbf{x}_0}\) and \({\mathbf{y}}\), as shown in Figure 2, where one classical diffusion process and two such task-adapted processes are depicted.