Nonlinear Spatial Filtering
Speech is an important cornerstone of human communication. It may, however, be corruptedby interfering background noise, which affects the perceived quality and intelligibility. For this reason, speech enhancement techniques that reduce noise and other disturbing effects, e.g., reverberation, play a central role in a variety of communication applications. These include traditional applications such as telephony and hearing aids but also newly emerging ones like speech-controlled human-computer interaction through an automatic speech recognition (ASR) system.
While single-channel speech enhancement approaches (traditional statistics-based or based on a deep neural network (DNN)) make use of tempo-spectral signal characteristics to perform the enhancement, multichannel approaches can additionally leverage spatial information contained in the noisy recordings obtained using multiple microphones. Today, the majority of new devices is equipped with multiple microphones, which emphasizes the practical relevance of spatial information processing. Traditionally, spatial filtering is achieved by so-called beamformers that aim at suppressing signal components from other than the target direction. The most prominent example is the filter-and-sum beamforming approach, which achieves spatial selectivity by filtering the individual microphone signals and adding them. This results in a linear operation with respect to the noisy input. The filter weights are designed to optimize some performance measure. For example, minimizing the noise variance subject to a distortionless constraint leads to the well-known minimum variance distortionless response (MVDR) beamformer.
Up to this day, a widespread processing pipeline for multichannel speech enhancement first performs a linear spatial filtering with a beamformer whose output is further processed by a single-channel algorithm, called the postfilter. This approach, however, restricts the spatial filter by the employed linear processing model and further prohibits to account for and take advantage of the interaction between spatial and spectral information. In contrast, nonlinear spatial filters for example implemented by neural networks can potentially overcome these limitations leading to improved multichannel speech enhancement methods. Therefore, our research focuses on the investigation of the performance potential of nonlinear spatial filters implemented using modern machine learning techniques.
Analysis of the performance benefit of non-linear spatial filters
While it is convenient to independently develop the spatial and spectral processing stage as shown on the left-hand side of Figure 1, theoretical analyses consolidated in our paper  show that this is optimal in the minimum mean-squared error (MMSE) sense only under the restrictive assumption that the noise signal is following a Gaussian distribution. In our work [1-3], we demonstrate based on evaluations that a joint spatial and spectral non-linear filter as illustrated on the right-hand side of Figure 1 indeed enables considerable performance improvements over the composed setup in non-Gaussian noise sencarios revealing the theoretical results to be highly relevant for practical applications.
Figure 2 depicts experiment results published in  for an inhomogeneous noise field created by five interfering noise sources. The noise and target signals are captured by a two-channel microphone array. As the number of interfering sources is larger than the number of microphones, the enhancement capabilities of traditional two-step approaches combining a linear spatial filter and a postfilter are limited. In the presented example, the interfering sources are emitting Gaussian bursts with only one interfering source being active per time segment. The uniform green coloration of the vertical stripes in the noisy spectrogram reflects the spectral stationarity. The vertical dark blue lines separate segments with different spatial properties. We would like to emphasize the visible and audible difference between the result obtained by an MVDR combined with a postfilter (middle in the bottom row) and the analytical joint spatial-spectral non-linear filter (last in the bottom row). While the latter can recover the original clean signal almost perfectly, the MVDR plus postfilter suffers from audible speech degradation and residual noise. The audio examples can be found here. We conclude from the experiment that a non-linear spatial filter that combines the spatial and spectral processing stage is capable of eliminiating more than the traditional M-1 sources from M microphones wihtout spatial adaptation and, thus, has some notably increased spatial selectivity.
In our work [1-3], we evaluated the performance of the non-linear spatial filter in different non-Gaussian noise scenarios. We find a notable performance improvement also in other experiments with heavy-tailed noise distributions, interfering human speakers, and real-world noise recordings from the CHiME3 database.
Do not miss the exciting audio examples for these experiments here.
Investigation of DNN-based joint non-linear filters
- Is non-linear as opposed to linear spatial filtering the main factor for good performance?
- Or is it rather the interdependency between spatial and tempo-spectral processing?
- And do temporal and spectral information have the same impact on spatial filtering performance?
- FT-JNF: joint spatial and tempo-spectral non-linear filter (all three sources of information)
- F-JNF: spatial-spectral non-linear filter
- T-JNF: spatial-temporal non-linear filter
- FT-NSF: non-linear spatial filter (fine-grained tempo-spectral information is excluded, global tempo-spectral information is accessible)
- F-NSF: non-linear spatial filter (fine-grained spectral information is excluded, global spectral information is accessible)
- T-NSF: non-linear spatial filter (fine-grained temporal information is excluded, global temporal information is accessible)
- LSF: oracle MVDR baseline
The results displayed in Figure 3 show that the non-linear spatial filter (FT-NSF) outperforms the oracle linear spatial filter (LSF, which is an oracle MVDR beamformer). Its seems that the more powerful processing of spatial information leads to a clear performance advantage. However, the picture changes when post-filtering is included. While a independent tempo-spectral post-filter leads to good results when applied to the output of a linear and distortionless MVDR beamformer, the non-linear spatial filter does not benefit that much. We can explain this by the fact that the non-linear spatial filter introduced quite some speech distortions that cannot be fixed by a mask-based post-filter. In contrast, if spatial and tempo-spectral information are processed jointly, the overall best performance is obtained. These results highlight the importance of interdependencies between spatial and tempo-spectral information.
In , we have investigated the spatial selectivity of the different filters. A visualization of spatial selectivity of the learnt filters are shown in Figure 4. We find that that joint processing in particular of spatial and spectral information increases the spatial selectivity.
Please find audio examples here.
- Kristina Tesch, Timo Gerkmann, "Nonlinear Spatial Filtering in Multichannel Speech Enhancement", IEEE/ACM Trans. Audio, Speech, Language Proc., 2021. [doi] [arxiv] [audio] ITG VDE award 2022
- Kristina Tesch, Timo Gerkmann, "Nonlinear Spatial Filtering For Multichannel Speech Enhancement in Inhomogeneous Noise Fields", IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Barcelona, Spain, May 2020. [doi]
- Kristina Tesch, Robert Rehr, Timo Gerkmann. "On Nonlinear Spatial Filtering in Multichannel Speech Enhancement", ISCA Interspeech, Graz, Austria, Sep. 2019. [doi]
- Kristina Tesch, Nils-Hendrik Mohrmann, Timo Gerkmann, "On the Role of Spatial, Spectral, and Temporal Processing for DNN-based Non-linear Multi-channel Speech Enhancement", ISCA Interspeech, Incheon, Korea, Sep. 2022 [arxiv]
- Kristina Tesch, Timo Gerkmann, "Insights into Deep Non-linear Filters for Improved Multi-channel Speech Enhancement", IEEE/ACM Trans. Audio, Speech, Language Proc., 2023. [arxiv][audio]
- Kristina Tesch, Timo Gerkmann, "Spatially Selective Deep Non-linear Filters for Speaker Extraction", accepted to ICASSP 2023. [arxiv][audio]