Theses
Wenn sie eine Abschlussarbeit zum Thema Signalverarbeitung schreiben möchten, dann sind sie hier genau richtig, um sich über mögliche Themen für Bachelor- und Masterarbeiten zu informieren. Wenn sie bereits eine Themenvorstellung haben, können sie uns auch gerne direkt ansprechen um ein individuelles Thema zu vereinbaren.
If you wish to write a thesis focusing on signal processing, you are at the right place to find a suitable topic. Here you can find proposed topics for bachelor and master theses. If you have any ideas or suggestions, you may also raise your own topic, in this case please do not hesitate to contact us in order to agree on an individual topic.
Proposal 1: Audio-Visual Signal Processing
Speech processing algorithms are an integral part of many technical devices that are ubiquitous in the lives of many people, like smart phones (e.g. speech recognition and telephony) or hearing aids.
Most of these algorithms rely on audio data alone, i.e. microphone recordings. In very noisy situations, e.g. in a crowded restaurant or on a busy street, however, the desired speech signal at the microphone can be severely corrupted by the undesired noise and the performance of speech processing algorithms may drop significantly. In such acoustically challenging situations, humans are known to also utilize visual cues, most prominently lip reading to still be able to comprehend what has been said. So far, this additional source of information is neglected by the vast majority of mainstream approaches.
In this thesis, we will explore ways to improve speech processing by also utilizing visual information especially when the audio signal is severely distorted. The major question in this context is, how the information obtained from the two modalities can be combined. While there are many sophisticated lip reading algorithms as well as audio-only speech processing algorithms out there, there are only few methods that use information from the one modality to benefit the processing of the other. In this thesis, you will first implement and thoroughly evaluate one or more reference approaches, identifying their strengths and weaknesses. Based on this analysis, we will strive for novel ways to improve the performance of theses approaches.
Basic knowledge of signal processing as well as programming skills, preferably Python or Matlab, are a definite plus.
If you’re interested or want more background information please don’t hesitate to get in touch with Julius Richter (julius.richter@uni-hamburg.de).
Contact: Julius Richter, Prof. Timo Gerkmann
Proposal 2: Phase-Aware Speech Enhancement
The ease of speech understanding in noise degrades greatly with decreasing signal-to-noise ratios. This effect is even more severe for hearing impaired people, hearing aid users and cochlear implant users. Noise reduction algorithms aim at reducing the noise to facilitate speech communication. This is particularly difficult when the noise signal is highly variant.
Many speech enhancement algorithms are based on a representation of the noisy speech signal in the short-time Fourier transform domain (STFT) domain. In this domain, speech is represented by the STFT magnitude and the STFT phase. In the last two decades research in speech enhancement algorithms mainly focused on improving the STFT magnitude, while the STFT phase was left unchanged. However, more recently, researchers pointed out that phase processing may improve speech enhancement algorithms further.
This thesis aims at implementing, analyzing and developing algorithms for phase processing. The overall goal is to obtain intelligible, high quality speech signals even from heavily distorted recordings in real-time. Of special interest in this context are the robust estimation of the clean speech phase from the noise-corrupted recording, the interplay between the traditional enhancement of spectral amplitudes and the recent developments in spectral phase enhancement, as well as the potential of modern machine learning techniques like deep learning.
First, existing approaches will be implemented and analyzed with respect to their strengths and weaknesses. Starting from this, new concepts to push the limits of the current state-of-the-art will be developed and realized. Finally, the derived algorithm(s) will evaluated and compared to existing approaches by means of instrumental measures and listening experiments. Depending on the outcome of this work, we encourage and strive for a publication of the results at a scientific conference or journal.
Experience and basic knowledge of signal processing are definitely helpful but not mandatory.
Contact: Tal Peer, Prof. Timo Gerkmann (UHH)
Proposal 3: Reduction of Ego-Noise for Automatic Speech Recognition on a Robot
Humans commonly use speech to communicate and interact with each other. A natural way to allow a human to interact with a robot is to use automatic speech recognition. For this, the voice commands need to be decoded from raw audio data which is recorded through the microphones attached to the robot. As robots can be employed in many different environments, the microphones often do not only capture the desired speech signal but also background noises and reverberation, i.e., reflections of the speech signal from the walls of a room. In addition to the environmental background noises, the microphones capture also the noise from the fans and the actuators during movement. The latter type of noise is emitted by the hardware of the robot and is often referred to as ego noise. The additional noises degrade the performance of speech recognizers significantly making speech recognition a challenging task on a robot platform.
The focus of this master thesis is the enhancement of speech signals that are corrupted by ego noise to improve the robustness of the robot's speech recognition system. The goal is to implement and to compare different speech enhancement techniques in terms of their capability to remove the robot's ego noise.
Goals
- You acquired noise data from the robot.
- You implemented various speech enhancement algorithms given in the literature.
- You compared the quality of the enhanced signal using instrumental measures and using the robot automatic speech recognition system.
- You developed or extended existing enhancement approaches to include sensor data from the robot to improve the enhancement.
Usefull Skills
- Background in speech processing and digital signal processing
- Background in machine-learning
- Programming experience in Matlab or Python
Contact
- Huajian Fang, Prof. Dr.-Ing. Timo Gerkmann
Proposal 4: Multichannel Speech Enhancement
Humans have impressive capabilities to filter background noise when focussing on a specific target speaker. This is largely due to the fact that humans have two ears that allow for spatial processing of the received sound. Many classical multichannel speech enhancement algorithms are also based on this approach. So-called beamformers (e.g. delay-and-sum beamformer or minimum variance distortionless response beamformer) enhance a speech signal by emphasizing a signal from the desired direction and attenuating signals from other directions.
In the field of single-channel speech enhancement, the use of machine learning techniques and in particular deep neural networks (DNNs) led to significant improvements. These networks can be trained to learn arbitrarily complex nonlinear functions. As such, it seems possible to improve the multichannel speech enhancement performance by using DNNs to learn spatial filters that are not restricted to a linear processing model as beamformers are.
In this thesis, we will investigate the potential of DNNs for multichannel speech enhancement. As a first step, existing DNN-based multichannel speech enhancement algorithms will be implemented and evaluated. Of particular importance is the question under which circumstances the ML-based approaches outperform existing methods. The gained insights and experience can then be used to improve the existing DNN-based algorithms.
Basic knowledge of signal processing, machine learning, and programming experience in any language (most preferable Python) is required for this thesis. Experience with a machine learning toolbox (e.g. PyTorch or TensorFlow) is helpful but not mandatory.
Contact: Alina Mannanova, Prof. Timo Gerkmann
Proposal 5: Speech Dereverberation for Hearing Aids
Overview
Reverberation is an acoustic phenomenon occurring when a sound signal reflects on walls and obstacles in a specular or diffuse way, which boils down to the so-called reverberated sound being the output of a convolutive filter whose Impulse Response depends on the geometry of the scenery and the materials constituting the different reflectors. Although it is a linear process, it is difficult to extract the original anechoic sound from its reverberated version in a blind scenario, that is, without knowing the so-called Room Impulse Response. Reverberation is perceived as natural in most scenari for unimpaired listeners, however, a strong reverberation will dramatically impact the speech processing performances of hearing-impaired listeners (speaker identification and separation, speech enhancement...)
Objectives
The objective of this PhD thesis is to design AI-based dereverberation algorithms for the hearing industry: in particular these algorithms should be real-time capable and robust to conditions encountered in real-life situations such as heavy reverberation, non-stationary noise, multiple sources and other interferences. They should also be lightweight enough to run on embedded devices such as cochlear implants or hearing aids.
Skills
-
Required skills
- Basic knowledge of signal processing
- Mathematical background in probabilistic theory and machine learning
- Programming experience (any language, preferably Python or Matlab)
-
Useful skills
- Theoretical background in room acoustics
- Mastery of a machine learning framework e.g. Pytorch, Tensorflow
Contact
Proposal 6: Audio-Visual Emotion Recognition (Social Signal Processing)
Overview
Emerging technologies such as virtual assistants (e.g. Siri and Alexa) and collaboration tools (e.g. Zoom and Discord) have enriched a large part of our lives. Traditionally, these technologies have focused majorly on understanding user commands and supporting their intended tasks. However, more recently, researchers in the fields of psychology and machine learning have been pushing towards the development of socially intelligent systems capable of understanding the emotional states of users as expressed by non-verbal cues. This interdisciplinary research field combining psychology and machine learning is termed Social Signal Processing [1].
Emotions are complex states of feeling, positive or negative, that result in physical and psychological changes that influence our behavior. Such emotional states are communicated by an individual through both verbal and non-verbal behavior. Emotions are abound across modalities- audio prosody, speech semantic content, facial reactions, and body language. Empirical findings in the literature reveal that different dimensions of emotion are best expressed by certain modalities. For example, the arousal dimension (activation-deactivation dimension of emotion) is best recognized by the audio modality over the video modality, and the valence dimension (pleasure-displeasure dimension of emotion) is best recognized by the video modality.
A review of emotion recognition techniques can be found in [2]. Recent work from our lab on speech emotion recognition can be found in [3].
Objectives
Recent literature reveals progress towards multimodal approaches for emotion recognition, thereby co-jointly and effectively modeling different dimensions of emotions. However, several challenges in a multimodal approach exist, such as,
- usage of appropriate modality fusion strategies,
- automatic alignment of different modalities,
- machine learning architectures for respective modalities, and,
- disentanglement or joint-training strategies for respective emotion dimensions.
In this project, we will explore multimodal approaches for emotion recognition, by aptly fusing different modalities and jointly modeling different dimensions of emotion.
First, existing approaches will be implemented and analyzed with respect to their strengths and weaknesses. Building on this, novel concepts to push the limitations of the current state-of-the-art will be developed and realized. Finally, the developed algorithm(s) will be evaluated and compared to existing approaches by means of quantitative and qualitative analysis of the results. Depending on the outcome of this work, we encourage and strive for a publication of the results at a scientific conference or journal.
Skills
- Programming experience (any language, preferably Python or Matlab)
- Mastery of a machine learning framework e.g. Pytorch, Tensorflow
- Experience and basic knowledge of signal processing are definitely helpful but not mandatory
References
- A. Vinciarelli, M. Pantic, and H. Bourlard. Social signal processing: Survey of an emerging domain. Image and vision computing, 27(12):1743–1759, 2009.
- B. W. Schuller, “Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol. 61, no. 5, pp. 90–99, Apr. 2018.
- N. Raj Prabhu, G. Carbajal, N. Lehmann-Willenbrock, & T. Gerkmann. (2021). End-to-end label uncertainty modeling for speech emotion recognition using Bayesian neural networks. arXiv preprint arXiv:2110.03299, https://arxiv.org/abs/2110.03299.
Contact
Proposal 7: Sequence Modeling for Speech and Language Processing
Speech processing tasks have benefited greatly from various sequence modeling techniques, from recurrent neural networks (LTSMs, GRUs) and temporal convolutional networks (TCNs) to attention-based (Transformer). Each technique has its unique characteristics and computational advantages and disadvantages. Speech signals contain complex structures at different levels of abstraction, from acoustic to linguistic and semantic information, and from local to global information. One (or more) of these aspects might be more relevant for each speech-related task, and might be modeled better by specific methods.
In this thesis, we would like to investigate how different architectures handle the various types of information contained in speech signals and potentially use our findings to better and more efficiently tackle various speech-related tasks, especially speech enhancement, separation and emotion recognition. In this context, large pre-trained speech models such as HuBERT and WavLM can also be explored.
Basic knowledge of machine learning and experience with Python programming are required. Knowledge of signal processing and/or experience with Python deep learning libraries is a plus.
Contact: Danilo de Oliveira, Prof. Timo Gerkmann