Theses

Wenn sie eine Abschlussarbeit zum Thema Signalverarbeitung schreiben möchten, dann sind sie hier genau richtig, um sich über mögliche Themen für Bachelor- und Masterarbeiten zu informieren. Wenn sie bereits eine Themenvorstellung haben, können sie uns auch gerne direkt ansprechen um ein individuelles Thema zu vereinbaren.

If you wish to write a thesis focusing on signal processing, you are at the right place to find a suitable topic. Here you can find proposed topics for bachelor and master theses. If you have any ideas or suggestions, you may also raise your own topic, in this case please do not hesitate to contact us in order to agree on an individual topic.

Proposal 1: Audio-Visual Signal Processing

Speech processing algorithms are an integral part of many technical devices that are ubiquitous in the lives of many people, like smart phones (e.g. speech recognition and telephony) or hearing aids.

Most of these algorithms rely on audio data alone, i.e. microphone recordings. In very noisy situations, e.g. in a crowded restaurant or on a busy street, however, the desired speech signal at the microphone can be severely corrupted by the undesired noise and the performance of speech processing algorithms may drop significantly. In such acoustically challenging situations, humans are known to also utilize visual cues, most prominently lip reading to still be able to comprehend what has been said. So far, this additional source of information is neglected by the vast majority of mainstream approaches.

In this thesis, we will explore ways to improve speech processing by also utilizing visual information especially when the audio signal is severely distorted. The major question in this context is, how the information obtained from the two modalities can be combined. While there are many sophisticated lip reading algorithms as well as audio-only speech processing algorithms out there, there are only few methods that use information from the one modality to benefit the processing of the other. In this thesis, you will first implement and thoroughly evaluate one or more reference approaches, identifying their strengths and weaknesses. Based on this analysis, we will strive for novel ways to improve the performance of theses approaches.

Basic knowledge of signal processing as well as programming skills, preferably Python or Matlab, are a definite plus.

Contact: Danilo de Oliveira, Prof. Timo Gerkmann

Proposal 2: Biomedial Signal Processing (BA/MA)

Biomedical signal processing is an exciting and largely growing field of research. Its growth comes along with the global demand for low-cost and high quality health-care services, the chances of gaining valuable insights, e.g. in the way our brain functions, and its application in brain computer interfaces (BCIs). Interestingly enough, many concepts of biomedical signal processing, such as multichannel electroencephalograms (EEG) processing, have their origin in applications areas such as communications engineering and speech processing. Examples for signal processing techniques used in both domains are Nonnegative Matrix Factorization (NMF), Principal Component Analysis (PCA), Independent Component Analysis (ICA), Support Vector Machines (SVM), second and higher order statistics, as well as detection and estimation theory. Indeed also many problems are similar: in both, biomedical systems and many areas of speech processing, the problem of undesired noise, e.g. from ocular artifacts (eye blinking) in EEG or a truck passing by during a microphone recording, is ubiquitous. Similar to portable speech communication devices, also for BCIs computational efficient solutions are mandatory to be able process the signals in real-time.

The goal of this thesis is to implement, analyze, and develop EEG signal processing algorithms by harvesting synergies between speech signal processing and biomedical signal processing. The project will be carried out in collaboration with Dr. Guido Nolte from the Dept. of Neurophysiology and Pathophysiology University Medical Center Hamburg-Eppendorf.

Supervisors: Prof. Timo Gerkmann (UHH), Dr. Guido Nolte (UKE)

Proposal 2: Phase-Aware Speech Enhancement

The ease of speech understanding in noise degrades greatly with decreasing signal-to-noise ratios. This effect is even more severe for hearing impaired people, hearing aid users and cochlear implant users. Noise reduction algorithms aim at reducing the noise to facilitate speech communication. This is particularly difficult when the noise signal is highly variant.

Many speech enhancement algorithms are based on a representation of the noisy speech signal in the short-time Fourier transform domain (STFT) domain. In this domain, speech is represented by the STFT magnitude and the STFT phase. In the last two decades research in speech enhancement algorithms mainly focused on improving the STFT magnitude, while the STFT phase was left unchanged. However, more recently, researchers pointed out that phase processing may improve speech enhancement algorithms further.

This thesis aims at implementing, analyzing and developing algorithms for phase processing. The overall goal is to obtain intelligible, high quality speech signals even from heavily distorted recordings in real-time. Of special interest in this context are the robust estimation of the clean speech phase from the noise-corrupted recording, the interplay between the traditional enhancement of spectral amplitudes and the recent developments in spectral phase enhancement, as well as the potential of modern machine learning techniques like deep learning.

First, existing approaches will be implemented and analyzed with respect to their strengths and weaknesses. Starting from this, new concepts to push the limits of the current state-of-the-art will be developed and realized. Finally, the derived algorithm(s) will evaluated and compared to existing approaches by means of instrumental measures and listening experiments. Depending on the outcome of this work, we encourage and strive for a publication of the results at a scientific conference or journal.

Experience and basic knowledge of signal processing are definitely helpful but not mandatory.

Contact: Tal Peer, Prof. Timo Gerkmann (UHH)

Proposal 3: Sound Source Localization and Tracking

Humans are remarkably skilled at localizing sound sources in complex acoustic environments. This ability is largely due to spatial cues, such as time and level differences, that our brain can interpret. In technical systems, microphone arrays enable similar spatial processing, allowing algorithms to estimate the positions of speakers and track them over time. Accurate and low-latency speaker tracking is a key component in many downstream applications, such as steering spatially selective filters (SSFs) that enhance a specific target speaker in multi-speaker scenarios.

Traditional localization methods rely on statistical models and techniques such as time difference of arrival (TDOA) estimation, steered response power (SRP), or subspace-based methods like MUSIC. While these techniques can be effective, they are often based on oversimplified statistical assumptions and struggle under real-world conditions. On the other hand, data-driven approaches can learn complex mappings between spatial features and speaker positions, thereby improving robustness in challenging noisy and reverberant environments. However, since they are usually very resource intensive, their practical applicability is limited. Hybrid methods address this issue by incorporating lightweight neural networks (NNs) into traditional estimation frameworks to refine the underlying statistical models while retaining a small computational overhead.

In this thesis, we will investigate low-complexity speaker localization and tracking algorithms that can reliably provide target direction estimates in real-time. These estimates are intended to guide deep SSFs in order to continuously enhance a target speaker in dynamic multi-speaker scenarios. Instead of completely replacing classical spatial filtering with deep learning, the focus will be on developing hybrid approaches as a lightweight front-end. The aim is to explore the trade-off between localization accuracy, robustness, and algorithmic complexity in realistic acoustic settings.

Basic knowledge of statistical signal processing, machine learning, and programming experience in Python is required for this thesis. Familiarity with machine learning frameworks such as PyTorch is an advantage but not strictly necessary.

Contact: Jakob Kienegger, Alina Mannanova, Prof. Timo Gerkmann

Proposal 3: Reduction of Ego-Noise for Automatic Speech Recognition on a Robot

Humans commonly use speech to communicate and interact with each other. A natural way to allow a human to interact with a robot is to use automatic speech recognition. For this, the voice commands need to be decoded from raw audio data which is recorded through the microphones attached to the robot. As robots can be employed in many different environments, the microphones often do not only capture the desired speech signal but also background noises and reverberation, i.e., reflections of the speech signal from the walls of a room. In addition to the environmental background noises, the microphones capture also the noise from the fans and the actuators during movement. The latter type of noise is emitted by the hardware of the robot and is often referred to as ego noise. The additional noises degrade the performance of speech recognizers significantly making speech recognition a challenging task on a robot platform.

The focus of this master thesis is the enhancement of speech signals that are corrupted by ego noise to improve the robustness of the robot's speech recognition system. The goal is to implement and to compare different speech enhancement techniques in terms of their capability to remove the robot's ego noise.

Goals

You acquired noise data from the robot.
You implemented various speech enhancement algorithms given in the literature.
You compared the quality of the enhanced signal using instrumental measures and using the robot automatic speech recognition system.
You developed or extended existing enhancement approaches to include sensor data from the robot to improve the enhancement.

Usefull Skills

Background in speech processing and digital signal processing
Background in machine-learning
Programming experience in Matlab or Python

Contact

Huajian Fang, Prof. Dr.-Ing. Timo Gerkmann

Proposal 4: Multichannel Speech Enhancement

Humans have impressive capabilities to filter background noise when focussing on a specific target speaker. This is largely due to the fact that humans have two ears that allow for spatial processing of the received sound. Many classical multichannel speech enhancement algorithms are also based on this approach. So-called beamformers (e.g. delay-and-sum beamformer or minimum variance distortionless response beamformer) enhance a speech signal by emphasizing a signal from the desired direction and attenuating signals from other directions.

In the field of single-channel speech enhancement, the use of machine learning techniques and in particular deep neural networks (DNNs) led to significant improvements. These networks can be trained to learn arbitrarily complex nonlinear functions. As such, it seems possible to improve the multichannel speech enhancement performance by using DNNs to learn spatial filters that are not restricted to a linear processing model as beamformers are.

In this thesis, we will investigate the potential of DNNs for multichannel speech enhancement. As a first step, existing DNN-based multichannel speech enhancement algorithms will be implemented and evaluated. Of particular importance is the question under which circumstances the ML-based approaches outperform existing methods. The gained insights and experience can then be used to improve the existing DNN-based algorithms.

Basic knowledge of signal processing, machine learning, and programming experience in any language (most preferable Python) is required for this thesis. Experience with a machine learning toolbox (e.g. PyTorch or TensorFlow) is helpful but not mandatory.

Contact: Alina Mannanova, Jakob Kienegger, Prof. Timo Gerkmann

Proposal 5: Speech Dereverberation for Hearing Aids

Overview

Reverberation is an acoustic phenomenon occurring when a sound signal reflects on walls and obstacles in a specular or diffuse way, which boils down to the so-called reverberated sound being the output of a convolutive filter whose Impulse Response depends on the geometry of the scenery and the materials constituting the different reflectors. Although it is a linear process, it is difficult to extract the original anechoic sound from its reverberated version in a blind scenario, that is, without knowing the so-called Room Impulse Response. Reverberation is perceived as natural in most scenari for unimpaired listeners, however, a strong reverberation will dramatically impact the speech processing performances of hearing-impaired listeners (speaker identification and separation, speech enhancement...)

Objectives

The objective of this PhD thesis is to design AI-based dereverberation algorithms for the hearing industry: in particular these algorithms should be real-time capable and robust to conditions encountered in real-life situations such as heavy reverberation, non-stationary noise, multiple sources and other interferences. They should also be lightweight enough to run on embedded devices such as cochlear implants or hearing aids.

Skills

Required skills

Basic knowledge of signal processing
Mathematical background in probabilistic theory and machine learning
Programming experience (any language, preferably Python or Matlab)

Useful skills

Theoretical background in room acoustics
Mastery of a machine learning framework e.g. Pytorch, Tensorflow

Contact

Proposal 6: Audio-Visual Emotion Recognition (Social Signal Processing)

Overview

Emerging technologies such as virtual assistants (e.g. Siri and Alexa) and collaboration tools (e.g. Zoom and Discord) have enriched a large part of our lives. Traditionally, these technologies have focused majorly on understanding user commands and supporting their intended tasks. However, more recently, researchers in the fields of psychology and machine learning have been pushing towards the development of socially intelligent systems capable of understanding the emotional states of users as expressed by non-verbal cues. This interdisciplinary research field combining psychology and machine learning is termed Social Signal Processing [1].

Emotions are complex states of feeling, positive or negative, that result in physical and psychological changes that influence our behavior. Such emotional states are communicated by an individual through both verbal and non-verbal behavior. Emotions are abound across modalities- audio prosody, speech semantic content, facial reactions, and body language. Empirical findings in the literature reveal that different dimensions of emotion are best expressed by certain modalities. For example, the arousal dimension (activation-deactivation dimension of emotion) is best recognized by the audio modality over the video modality, and the valence dimension (pleasure-displeasure dimension of emotion) is best recognized by the video modality.

A review of emotion recognition techniques can be found in [2]. Recent work from our lab on speech emotion recognition can be found in [3].

Objectives

Recent literature reveals progress towards multimodal approaches for emotion recognition, thereby co-jointly and effectively modeling different dimensions of emotions. However, several challenges in a multimodal approach exist, such as,

usage of appropriate modality fusion strategies,
automatic alignment of different modalities,
machine learning architectures for respective modalities, and,
disentanglement or joint-training strategies for respective emotion dimensions.

In this project, we will explore multimodal approaches for emotion recognition, by aptly fusing different modalities and jointly modeling different dimensions of emotion.

First, existing approaches will be implemented and analyzed with respect to their strengths and weaknesses. Building on this, novel concepts to push the limitations of the current state-of-the-art will be developed and realized. Finally, the developed algorithm(s) will be evaluated and compared to existing approaches by means of quantitative and qualitative analysis of the results. Depending on the outcome of this work, we encourage and strive for a publication of the results at a scientific conference or journal.

Skills

Programming experience (any language, preferably Python or Matlab)
Mastery of a machine learning framework e.g. Pytorch, Tensorflow
Experience and basic knowledge of signal processing are definitely helpful but not mandatory

References

A. Vinciarelli, M. Pantic, and H. Bourlard. Social signal processing: Survey of an emerging domain. Image and vision computing, 27(12):1743–1759, 2009.
B. W. Schuller, “Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol. 61, no. 5, pp. 90–99, Apr. 2018.
N. Raj Prabhu, G. Carbajal, N. Lehmann-Willenbrock, & T. Gerkmann. (2021). End-to-end label uncertainty modeling for speech emotion recognition using Bayesian neural networks. arXiv preprint arXiv:2110.03299, https://arxiv.org/abs/2110.03299.

Contact

Proposal 7: Sequence Modeling for Speech and Language Processing

Speech processing tasks have benefited greatly from various sequence modeling techniques, from recurrent neural networks (LTSMs, GRUs) and temporal convolutional networks (TCNs) to attention-based (Transformer). Each technique has its unique characteristics and computational advantages and disadvantages. Speech signals contain complex structures at different levels of abstraction, from acoustic to linguistic and semantic information, and from local to global information. One (or more) of these aspects might be more relevant for each speech-related task, and might be modeled better by specific methods.

In this thesis, we would like to investigate how different architectures handle the various types of information contained in speech signals and potentially use our findings to better and more efficiently tackle various speech-related tasks, especially speech enhancement, separation and emotion recognition. In this context, large pre-trained speech models such as HuBERT and WavLM can also be explored.

Basic knowledge of machine learning and experience with Python programming are required. Knowledge of signal processing and/or experience with Python deep learning libraries is a plus.

Contact: Danilo de Oliveira, Prof. Timo Gerkmann

Theses

Proposal 1: Audio-Visual Signal Processing

Proposal 2: Biomedial Signal Processing (BA/MA)

Proposal 2: Phase-Aware Speech Enhancement

Proposal 3: Sound Source Localization and Tracking

Proposal 3: Reduction of Ego-Noise for Automatic Speech Recognition on a Robot

Goals

Usefull Skills

Contact

Proposal 4: Multichannel Speech Enhancement

Proposal 5: Speech Dereverberation for Hearing Aids

Overview

Objectives

Skills

Required skills

Useful skills

Contact

Proposal 6: Audio-Visual Emotion Recognition (Social Signal Processing)

Overview

Objectives

Skills

References

Contact

Proposal 7: Sequence Modeling for Speech and Language Processing