Nonlinear Spatial Filtering

Introduction
Speech is an important cornerstone of human communication. It may, however, be corruptedby interfering background noise, which affects the perceived quality and intelligibility. For this reason, speech enhancement techniques that reduce noise and other disturbing effects, e.g., reverberation, play a central role in a variety of communication applications. These include traditional applications such as telephony and hearing aids but also newly emerging ones like speech-controlled human-computer interaction through an automatic speech recognition (ASR) system.
While single-channel speech enhancement approaches (traditional statistics-based or based on a deep neural network (DNN)) make use of tempo-spectral signal characteristics to perform the enhancement, multichannel approaches can additionally leverage spatial information contained in the noisy recordings obtained using multiple microphones. Today, the majority of new devices is equipped with multiple microphones, which emphasizes the practical relevance of spatial information processing. Traditionally, spatial filtering is achieved by so-called beamformers that aim at suppressing signal components from other than the target direction. The most prominent example is the filter-and-sum beamforming approach, which achieves spatial selectivity by filtering the individual microphone signals and adding them. This results in a linear operation with respect to the noisy input. The filter weights are designed to optimize some performance measure. For example, minimizing the noise variance subject to a distortionless constraint leads to the well-known minimum variance distortionless response (MVDR) beamformer.
Up to this day, a widespread processing pipeline for multichannel speech enhancement first performs a linear spatial filtering with a beamformer whose output is further processed by a single-channel algorithm, called the postfilter. This approach, however, restricts the spatial filter by the employed linear processing model and further prohibits to account for and take advantage of the interaction between spatial and spectral information. In contrast, nonlinear spatial filters for example implemented by neural networks can potentially overcome these limitations leading to improved multichannel speech enhancement methods. Therefore, our research focuses on the investigation of the performance potential of nonlinear spatial filters implemented using modern machine learning techniques.
Analysis of the performance benefit of nonlinear spatial filters
While it is convenient to independently develop the spatial and spectral processing stage as shown on the left-hand side of Figure 1, theoretical analyses consolidated in our paper [1] show that this is optimal in the minimum mean-squared error (MMSE) sense only under the restrictive assumption that the noise signal is following a Gaussian distribution. In our work [1-3], we demonstrate based on evaluations that a joint spatial and spectral nonlinear filter as illustrated on the right-hand side of Figure 1 indeed enables considerable performance improvements over the composed setup in non-Gaussian noise sencarios revealing the theoretical results to be highly relevant for practical applications.

Figure 2 depicts experiment results published in [1] for an inhomogeneous noise field created by five interfering noise sources. The noise and target signals are captured by a two-channel microphone array. As the number of interfering sources is larger than the number of microphones, the enhancement capabilities of traditional two-step approaches combining a linear spatial filter and a postfilter are limited. In the presented example, the interfering sources are emitting Gaussian bursts with only one interfering source being active per time segment. The uniform green coloration of the vertical stripes in the noisy spectrogram reflects the spectral stationarity. The vertical dark blue lines separate segments with different spatial properties. We would like to emphasize the visible and audible difference between the result obtained by an MVDR combined with a postfilter (middle in the bottom row) and the analytical joint spatial-spectral nonlinear filter (last in the bottom row). While the latter can recover the original clean signal almost perfectly, the MVDR plus postfilter suffers from audible speech degradation and residual noise. The audio examples can be found here. We conclude from the experiment that a nonlinear spatial filter that combines the spatial and spectral processing stage is capable of eliminiating more than the traditional M-1 sources from M microphones wihtout spatial adaptation and, thus, has some notably increased spatial selectivity.
In our work [1-3], we evaluated the performance of the nonlinear spatial filter in different non-Gaussian noise scenarios. We find a notable performance improvement also in other experiments with heavy-tailed noise distributions, interfering human speakers, and real-world noise recordings from the CHiME3 database.
Do not miss the exciting audio examples for these experiments here.
Future work
Publications
- Kristina Tesch, Timo Gerkmann, "Nonlinear Spatial Filtering in Multichannel Speech Enhancement", IEEE/ACM Trans. Audio, Speech, Language Proc., 2021. [doi] [arxiv] [audio] ITG VDE award 2022
- Kristina Tesch, Timo Gerkmann, "Nonlinear Spatial Filtering For Multichannel Speech Enhancement in Inhomogeneous Noise Fields", IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Barcelona, Spain, May 2020. [doi]
- Kristina Tesch, Robert Rehr, Timo Gerkmann. "On Nonlinear Spatial Filtering in Multichannel Speech Enhancement", ISCA Interspeech, Graz, Austria, Sep. 2019. [doi]
- Kristina Tesch, Nils-Hendrik Mohrmann, Timo Gerkmann, "On the Role of Spatial, Spectral, and Temporal Processing for DNN-based Non-linear Multi-channel Speech Enhancement", ISCA Interspeech, Incheon, Korea, Sep. 2022 [arxiv]