Combining Statistical Signal Processing and Machine-Learning
Introduction
Speech is commonly used for communication among human beings and is also employed to interact with personal computers. With increasing levels of background noise, understanding and recognizing speech becomes more and more difficult. Moreover, background noise degrades the perceived quality of speech. To improve the quality and possibly also the intelligibility of noise corrupted speech, speech enhancement algorithms are employed. A typical approach to enhance noisy speech signals using only a single observation is to use time-frequency representations, e.g., the short-time Fourier transform. For this, the noise corrupted speech signal is enhanced by suppressing the time-frequency points that mainly contain noise. We distinguish between two broad categories of enhancement algorithms:
- Conventional non-machine-learning (non-ML) based approaches
These approaches estimate the statistical parameters of the speech and the background noise which are required for a blind enhancement of the noisy observation. - Machine-learning (ML) based approaches
The statistics of speech and noise required for the enhancement are learned from training examples which are used in the enhancement step afterwards.
Our recent research focuses on the robustness of single speech enhancement algorithms. For this, synergies between ML-based and non-ML-based approaches are exploited. Here, we summarize our results. The work can be divided into two broad topics:
- Research on the generalization of deep neural network (DNN) based speech enhancement algorithms.
- Improving the performance of ML-based enhancement spectral envelope speech enhancement approaches, i.e., ML-based approaches that only model a spectral envelope of speech.
In [1]–[3], we consider the generalization capabilities of speech enhancement approaches that are based on DNNs. We analyze how various input features and training data affect the generalization of DNN-based speech enhancement approaches to unseen noise conditions. One approach to improve this generalization is to use dynamic noise-aware training (NAT) where an estimate of a noise power spectral density (PSD) is appended to periodogram input features. We propose a novel type of input feature where the noise PSD is used for normalization instead of appending it. The normalization results in features that are based on signal-to-noise ratios (SNRs) and we show that these features allow the DNN approaches to generalize unseen noise conditions better than NAT especially if only small training data sets are available. The is shown using experimental evaluations using measures like Perceptual Evaluation of Speech Quality (PESQ) but also by inspecting the input features and the internal representations using t-distributed stochastic neighbor embedding (t-SNE).
In [4]–[6], we investigated ML-based speech enhancement algorithms where speech is only modeled by its spectral envelope. The fundamental tone and its harmonics, which are caused by the vibrating vocal cords, are not explicitly modeled. We refer to these methods as machine-learning spectral envelope (MLSE) based methods. Even though such models have advantages in terms of generalization and computational complexity, the noise between the spectral harmonics is not suppressed if Gaussian speech enhancement filters are used. We show that super-Gaussian enhancement filters can be used to reduce this undesired effect [4], [5]. This improves the quality of the enhanced signal as verified by both instrumental measures and listening tests. Further, also a combination of non-ML and ML based approaches has been investigated in [6].
SNR-Based Features for Robust DNN-Based Speech Enhancement
To improve the robustness of DNN based speech enhancement algorithms, we propose to employ SNR-based features which are directly related to the a priori SNR and the a posteriori SNR in [1]–[3]. These features are motivated by non-ML-based clean speech estimators which are analytically derived from statistical models. We refer to these features as SNR-NAT and in contrast to the existing NAT, where an estimate of the noise PSD is appended to the input vector of noisy features, the noise PSD is used for normalization. In our work, the speech and the noise PSD are estimated using conventional algorithms as these approaches are robust to unseen noise conditions. Therefore, both of these features can also be seen as a combination of conventional and ML-based speech enhancement approaches.
In [2], the performance of SNR-NAT and NAT features are compared using a cross-validation procedure. The training data is generated from a pool of nine different noise types, which is used to artificially corrupt clean speech signals. To make the DNN learn to deal with different acoustic conditions, each speech signal is corrupted at a different SNR and also the overall level is varied. This data is used to train nine different models where for each model one of the noise types is not included during training. Additionally, speech material corrupted by randomly concatenated non-speech sounds is added to the training data of all models to increase the base robustness. All models predict a gain function from the respective features using a feed-forward network with three hidden layers each having 1024 rectifying linear units. The predicted gain function is in the range of [0, 1] and hence the non-linearity of the output layer is given by the sigmoid function.

In [2], the models are evaluated using instrumental measures, e.g., PESQ, on the noise types also used for training. For this, however, we use always the model which has not been trained on the noise type that is used for corrupting the tested speech signal, i.e., the noise types during the evaluation are always unseen. Figure 1 shows the results: On the left side, the peak level of the speech signal is varied, which is directly related to the overall level of the input signal while the SNR is fixed to 5 dB. On the right side, the SNR is varied from -5 dB to 20 dB in 5 dB steps, while the peak level of the speech signal varied in the range also used for training. The experimental evaluations show two advantages of the SNR-NAT features: First, the SNR-NAT features make the DNN-based enhancement scheme scale-invariant, i.e., the results do not depend on the overall level of the input signal. Second, the results show that the SNR-NAT features are more robust in unseen noise conditions than the NAT features.

Further, the effectiveness of the proposed features have been validated using subjective listening experiments in [3]. Using multiple stimuli with a hidden reference and anchor (MUSHRA) test, the DNN using only the noisy input features and the DNN using the proposed normalized features are compared to a non-ML approach, the noisy signal and an anchor signal. The algorithms have been evaluated in factory and traffic noise. Figure 2 shows the results of the MUSHRA where the linking lines indicate results that are not statistically significant. The statistical significance has been tested using a repeated-measures analysis of variance and post-hoc tests. The results show that the DNN-based speech enhancement approach using the proposed normalized features has been rated significantly better in unseen noise conditions. Audio examples from the listening experiment can be found here.
A more in-depth analysis of the NAT and SNR-NAT features is provided in [1] where the input features and the internal states are analyzed using t-SNE. This method allows to embed data vectors from a high-dimensional space into a low-dimensional space. We applied this method to the NAT features and the SNR-NAT features extracted from four sentences spoken by two male and two female speakers. These sentences are corrupted by seven different noise types, where the input SNR is fixed to 5 dB while the peak level of the signal is set to -6 dB.

Figure 3 shows the t-SNE embeddings for the raw input features, i.e., for the NAT and the SNR-NAT features. For the NAT features, the t-SNE embeddings cluster depending on the noise type and the data points can be easily separated. From this, we conclude that the NAT features are highly dependent on the noise type. In contrast to the NAT features, the embeddings of the SNR-NAT features do not show any clusters from which we conclude that these features are less dependent on the noise type. Further analyses in [1] show that similar observations can also be made for embedding structures of the internal states of the DNN and also for the output signal.
Super-Gaussian Estimators for MLSE-Based Speech Enhancement
MLSE-based speech enhancement approaches model speech only using its spectral envelope, but not its spectral fine structure. Such models have advantages in terms of generalization and computational complexity over more sophisticated models. The quality of the enhancement, however, suffers from low noise suppression during speech activity.
In our work, we use two examples of MLSE-based approaches: The first one uses a deep neural network (DNN) based phoneme recognizer for estimating the speech power spectral density (PSD), while the second one is nonnegative matrix factorization (NMF) based. The DNN-based phoneme recognizer is used to recognize the spoken phoneme from a noisy observation which is used to select pre-trained speech PSDs in the enhancement process. The NMF-based approach uses only a low amount of basis vectors such that only the spectral envelope is modeled.

Figure 4 shows the estimated a priori SNR of a non-MLSE-based enhancement approach and two MLSE-based approaches. The non-MLSE approach is able to estimate the spectral fine structure of the input signal, while the MLSE-based approaches are unable to achieve this.

In [4], [5], we show that super-Gaussian estimators allow to suppress the background noise between the spectral harmonics which cannot be achieved by Gaussian estimators. Figure 5 shows the spectral gain functions of a Gaussian clean speech estimator in the upper row and a super-Gaussian clean speech estimator in the lower row. Although MLSE-based enhancement approaches are unable to estimate the fine structure in the speech PSD, it is possible to recover the fine structure by using super-Gaussian clean speech estimators.

The beneficial effect of the super-Gaussian clean speech estimators for MLSE-based speech enhancement approaches has been validated using listening experiments. Figure 6 shows the results of a multi-stimulus test with hidden reference and anchor (MUSHRA). The upper and the lower edge of the box show the upper and lower quartile while the bar within the box is the median. The upper whisker reaches to the largest data point that is smaller than the upper quartile plus 1.5 times the interquartile range. The lower whisker is defined analogously. For the non-MLSE-based enhancement scheme, there is little difference between the Gaussian and super-Gaussian estimator. For the used examples of MLSE-based algorithms, the difference in the perceived quality between Gaussian and super-Gaussian estimators is considerable. It has been found statistically significant using a Wilcoxon signed rank test and a significance level of 5%.
The following noise examples are taken from the listening experiment. Here, speech taken from a female and a male speaker has been corrupted by traffic noise and babble noise at 5 dB SNR. For the examples, a Gaussian and a super-Gaussian short-term spectral amplitude estimator are employed for which the attenuation is limited to a maximum of 12 dB.
traffic noise | babble noise | |||||
Noisy | Gauss | Super-Gauss | Noisy | Gauss | Super-Gauss | |
---|---|---|---|---|---|---|
Female | ▸ | ▸ | ▸ | ▸ | ▸ | ▸ |
Male | ▸ | ▸ | ▸ | ▸ | ▸ | ▸ |
In [4], mainly clean speech estimators have been considered that have been derived using an additive mixing model. In [5], we show that similar effects can also be observed for estimators based on the MixMax model. In comparison to estimators based on additive models, this approach leads to less musical tone artifacts. This can be verified by the following audio examples. As above, traffic noise and babble noise are used again. The filters are a super-Gaussian log-spectral amplitude estimator based on an additive model and a super-Gaussian log-spectral amplitude estimator based on the MixMax model. The maximum attenuation is set to 15 dB.
traffic noise | babble noise | |||||
Noisy | additive | MixMax | Noisy | additive | MixMax | |
---|---|---|---|---|---|---|
Female | ▸ | ▸ | ▸ | ▸ | ▸ | ▸ |
Male | ▸ | ▸ | ▸ | ▸ | ▸ | ▸ |
Further audio examples of the approaches presented in [4], [5] can be found here and here.
Related Publications
[1] R. Rehr and T. Gerkmann, “An analysis of noise-aware features in combination with the size and diversity of training data for DNN-based speech enhancement,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019. accepted for publication.
[2] R. Rehr and T. Gerkmann, “Robust DNN-based speech enhancement with limited training data,” in ITG Conference on Speech Communication, Oldenburg, Germany, 2018.
[3] R. Rehr and T. Gerkmann, “Normalized Features for Improving the Generalization of DNN Based Speech Enhancement,” arXiv:1709.02175 [cs], Sep. 2017 [Online]. Available: http://arxiv.org/abs/1709.02175. [Accessed: 24-Jan-2019] arXiv: 1709.02175.
[4] R. Rehr and T. Gerkmann, “On the importance of super-Gaussian speech priors for machine-learning based speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 2, pp. 357–366, Feb. 2018.
[5] R. Rehr and T. Gerkmann, “MixMax approximation as a super-Gaussian log-spectral amplitude estimator for speech enhancement,” in Interspeech, Stockholm, Sweden, 2017.
[6] R. Rehr and T. Gerkmann, “A combination of pre-trained approaches and generic methods for an improved speech enhancement,” in ITG Conference on Speech Communication, Paderborn, Germany, 2016, pp. 51–55.