Teaching
Courses
- LECTURES / VORLESUNGEN
Statistical Signal Processing |
M.Sc. Winter Semester |
Speech Signal Processing / Sprachsignalverarbeitung |
M.Sc. Summer Semester |
Digital Media Signal Processing/ Digitale Mediensignalverarbeitung |
B.Sc. Summer Semester |
Mathematik der Informatik für Studierende des Lehramts |
B.Sc. Summer Semester |
- SEMINARS / SEMINARE
Data Science |
B.Sc. Summer Semester |
Aktuelle Themen der Audiosignalverarbeitung |
B.Sc. Winter Semester |
- PRAKTIKA
Praktikum Audiosignalverarbeitung | B.Sc. Winter Semester |
- PROJECTS / PROJEKTE, for example
Deep Learning for Audio Processing + integriertes Seminar |
B.Sc. Winter Semester |
Deep Learning for Speech Emotion Recognition, and Synthesis + integriertes Seminar |
M.Sc. Winter Semester |
B.Sc. and M.Sc. Theses
Here you can find what which topics our students are currently working on, a list of completed theses, and open topics for new theses. If you are interested in one of these topics, please reach out to the listed contact person. If you have your own idea for a thesis topic in the field of signal processing, please contact us and we will see how we can develop your idea together.
Theses in Progress
B.Sc. Theses in Progress
- Jim-Frederic Gerth: Analysis of Meta-Learning for Speech Enhancement with Circular Microphone Arrays robust to Varying Diameter
- Leander Thiel: Extending Spatial Likelihood Coding with Complex Spectral Mapping for Joint Localization and Separation in Multi-Speaker Environments
M.Sc. Theses in Progress
- Alexander Baur: Decoupled Denoising Diffusion for Emotional Speech Conversion
Completed Bachelor Theses
- Michael Römer: Causal Score-based Diffusion Models for Real-Time Speech Enhancement, Jun 2025
- Marvin Mielchen: Bridging Particle Filtering and Machine Learning for Online DOA Tracking in Multichannel Speech Enhancement, Apr 2025
- Duru Zeynep Keçeci: Speech Emotion and Style Conversion Using Categorical Emotions with EARS Dataset, Mar 2025
- Akram Ahmed Ammar: Semantically Factorized Latent Space for Speech Enhancement, Jan 2025
- Rebecca Schnoor: Computation of Vocal Synchrony in Spontaneous Interaction using Deep Neural Networks, Jan 2025
- Deepesha Saurty: Improving Emotional Accuracy in Speech Synthesis through Arousal and Valence Manipulation, Jan 2025
- Kevin Luu: Informed speaker extraction using deep multichannel filtering, Jan 2025
- Leonie Junkher: Discrete vs Continuous Latent Diffusion Models for Speech Enhancement, Jan 2025
- Till Svajda: Vocal Dereverberation with Diffusion-Based Generative Models, Jan 2025
- Jonas Rochdi: Real-Time Speech Enhancement Using Mamba, Dez 2024
- Constantin Alexander Auga: Latent Diffusion Models for Emotion-Conditioned Speech Synthesis, Nov 2024
- Gerrit Schwitalla: Investigating the Effect of Filtering In-the-wild Datasets for Deep Learning-based Speech Enhancement, Nov 2024
- Marten Linge: Study of positional encoding approaches for speech enhancement, Nov 2024
- Helena Becker: "Speech Enhancement with Diffusion Models using Vision Transformers", August 2024
- Maris Hillemann: “Fractional Fourier Transform in Transformer-based Speech Enhancement”, Universität Hamburg, Juni 2024
- Sebastian Zaczek: “Recent Advances in Single-Channel Speech Separation of Same-Gender Speaker Mixtures”, Universität Hamburg, Mai 2024
- Tom Schimansky: "“Music Source Separation using Score-Based Generative Models”, Universität Hamburg, Mai 2024
- Alieksieiev Oleksil: "Implementation and Analysis of a Real-time End-to-End Audio-visual Speech Enhancement Framework", Feb 2024
- Robin Breitgelder: "Lokalisierung von beweglichen Schallquellen mit tiefen neuronalen Netzen", Universität Hamburg, Okt. 2023
- Johannes Kolhoff: "Real-Time Phase Retrieval for Speech using Improved iterative algorithms", Universität Hamburg, Juni 2023
- Katirci Berkkan: "Robust preprocessing for real-time audio-visual speech separation", Universität Hamburg, April 2023
- David Tran: "Comparitive Analysis of Multi-Stage Speech Enhancement Methods", Universität Hamburg, April 2023
- Torben Hellriegel: "Evaluation of Speaker Recognition Neural Networks", Universität Hamburg, Nov. 2022
- Nico Petereit: "Implementation eines Phase Vocoder zur Echtzeit-Anwendung im musikalichen Kontext.", Universität Hamburg, Aug. 2022
- Chams Alassil Khoury: "Evaluation eines neuronalen Netzes für die mehrkanalige Sprachverbesserung", Universität Hamburg, Oct. 2021
- Hannes Neuschmidt: "Phase-Aware Speech Enhancement with Neural Networks", Universität Hamburg, May 2021
- Viktor Skiba: "Aktuelle Methoden zur Quellentrennung von Musik im Kontext der Signal Separation Evaluation Campaign", Universität Hamburg, Apr. 2020
- Theodor Wulff: "Classification of Background Noise in Movie Sound", Universität Hamburg, Apr. 2020
- Mihai-Ali Popa: "Deep Learning Methods for Score-to-Audio Music Generation", Universität Hamburg, Nov. 2019
- Stephanie Bramlage: "Acoustic Emotion Recognition in Group Discussions with Neural Networks", Universität Hamburg, Jul. 2019
- Nils-Hendrik Mohrmann: "Reduzierung des Eigengeräusches für eine robuste automatische Spracherkennung mit Robotern", Universität Hamburg, Jun. 2019
- Yannik Höfgen: "Analysis of Machine Learning Supported Acoustic Beamforming", Universität Hamburg, Feb. 2019
- Walther Stieben: "Echtzeit Beamforming auf einem eingebetteten System - Implementierung und Analyse", Universität Hamburg, Dec. 2018
- Leroy Bartel: "Machine Learning Based Phase Estimation for Speech Enhancement", Universität Hamburg, Nov. 2018
- Thomas Walther: "Multi-channel speech command recognition for humanoid robots", Universität Hamburg, Sep. 2018
- Nils Matze Heine: "Real-Time Automatic Gain Control für Singing Voice Applications", Universität Hamburg, May 2018
Completed Master Theses
- Alexander Schiessl: Complex-Valued Deep Neural Networks for Speech Enhancement, Apr 2025
- Haruka Inoba: Compensation for Imbalanced In-the-wild Emotional Speech Dataset for Emotional Speech Synthesis, Jan 2025
- Marlon Kramer: "Cross-modal random network-based fusion for audio-visual continuous emotion recognition", Universität Hamburg, Juni 2024
- Diana Rueda: "Modeling paralinguistic mimicry for emotion recognition in social interactions using attention", Universität Hamburg, Sept. 2023
- Roland Fredenhagen: "Sound Source Localization using the Azre Kinect's Microphone Array on a Robot", Universität Hamburg, Juli 2023
- David Zadim: "Incorporating Model Uncertainties in Deep Iterative Projections for Ptychography", Universität Hamburg, Juli 2023
- Julian Rettelbach: "Complex logarithm input features for phase-aware deep speech", Universität Hamburg, Juni 2023
- Walter Stieben: "Single-Channel Speech Enhancement with Graph-Convolutional Neural Networks", Universität Hamburg, Mai 2023
- Jan Zickermann: "Multi-Channel Speech Enhancement with Multi-Stage Deep Neural Networks", Universität Hamburg, Jan. 2023
- Niklas Wittmer: "Multichanel Joint Reduction of Ego-Noise and Environmental Noise with Variational Autoencoders and Non-Negative Matrix Factorization", Universität Hamburg, Nov. 2022
- Julian Tobergte: "Statistically Motiated Multi-frame Extension for Neural Network-based Single-channel Speech Dereverberation", Universität Hamburg, Nov. 2022
- Paul Hoelzen: "On self-supervised audio-visual deep learning approaches for sound spatialization", Universität Hamburg, August 2022
- Tanja Flemming: "Representation Learning for Sound Source Localization", Universität Hamburg, June 2022
- Stephanie Brramlage: "Fatigue Prediction for Offshore Wind Turbine Support Structures With the help of Neural Networks and SCADA Data Signal Processing", Universität Hamburg, Apr. 2022
- Simon Welker: "Deep Neural Networks for Phase Retrieval in Diffractive Imaging and Speech Processing", Universität Hamburg, Dec. 2021
- Nils-Hendrik Mohrmann: "Comparative Study of Deep Neural Networks for Multichannel Speech Enhancement", Universität Hamburg, Nov. 2021
- Jeanine Liebold: "Audio-Visual Phoneme Recognition with Deep Neural Networks", Universität Hamburg, Nov. 2021
- Jose Alberto Rodriguez Parra Flores: "Multiframe TasNet for Single Channel Speech Enhancement", Universität Hamburg, Apr. 2021
- Leroy Bartel: "Deep Learning based Speaker Count Estimation for Single-Channel Speech Separation", Universität Hamburg, Mar. 2021
- Marc Siemering: "Real-Time Speech Separation with Deep Attractor Networks on an Embedded System", Universität Hamburg, Nov. 2020
- Klaus-Johan Ziegert: "Phasenrekonstruktion Transienter Sprache", Universität Hamburg, Nov. 2020
- Kristina Tesch: "On the Role of Non-Linear Filtering in Multichannel Speech Enhancement", Universität Hamburg, Mar. 2019
- Praveen Baburao Kulkarni: "Deep Learning Techniques for Side Channel Analysis", Universität Hamburg, Aug. 2018
- Konstantin Kobs: "Voice Conversion Using Modern Machine Learning Techniques", Universität Hamburg, Nov. 2017
Proposal 1: Audio-Visual Signal Processing
Speech processing algorithms are an integral part of many technical devices that are ubiquitous in the lives of many people, like smart phones (e.g. speech recognition and telephony) or hearing aids.
Most of these algorithms rely on audio data alone, i.e. microphone recordings. In very noisy situations, e.g. in a crowded restaurant or on a busy street, however, the desired speech signal at the microphone can be severely corrupted by the undesired noise and the performance of speech processing algorithms may drop significantly. In such acoustically challenging situations, humans are known to also utilize visual cues, most prominently lip reading to still be able to comprehend what has been said. So far, this additional source of information is neglected by the vast majority of mainstream approaches.
In this thesis, we will explore ways to improve speech processing by also utilizing visual information especially when the audio signal is severely distorted. The major question in this context is, how the information obtained from the two modalities can be combined. While there are many sophisticated lip reading algorithms as well as audio-only speech processing algorithms out there, there are only few methods that use information from the one modality to benefit the processing of the other. In this thesis, you will first implement and thoroughly evaluate one or more reference approaches, identifying their strengths and weaknesses. Based on this analysis, we will strive for novel ways to improve the performance of theses approaches.
Basic knowledge of signal processing as well as programming skills, preferably Python or Matlab, are a definite plus.
Contact: Danilo de Oliveira, Prof. Timo Gerkmann
Proposal 2: Phase-Aware Speech Enhancement
The ease of speech understanding in noise degrades greatly with decreasing signal-to-noise ratios. This effect is even more severe for hearing impaired people, hearing aid users and cochlear implant users. Noise reduction algorithms aim at reducing the noise to facilitate speech communication. This is particularly difficult when the noise signal is highly variant.
Many speech enhancement algorithms are based on a representation of the noisy speech signal in the short-time Fourier transform domain (STFT) domain. In this domain, speech is represented by the STFT magnitude and the STFT phase. In the last two decades research in speech enhancement algorithms mainly focused on improving the STFT magnitude, while the STFT phase was left unchanged. However, more recently, researchers pointed out that phase processing may improve speech enhancement algorithms further.
This thesis aims at implementing, analyzing and developing algorithms for phase processing. The overall goal is to obtain intelligible, high quality speech signals even from heavily distorted recordings in real-time. Of special interest in this context are the robust estimation of the clean speech phase from the noise-corrupted recording, the interplay between the traditional enhancement of spectral amplitudes and the recent developments in spectral phase enhancement, as well as the potential of modern machine learning techniques like deep learning.
First, existing approaches will be implemented and analyzed with respect to their strengths and weaknesses. Starting from this, new concepts to push the limits of the current state-of-the-art will be developed and realized. Finally, the derived algorithm(s) will evaluated and compared to existing approaches by means of instrumental measures and listening experiments. Depending on the outcome of this work, we encourage and strive for a publication of the results at a scientific conference or journal.
Experience and basic knowledge of signal processing are definitely helpful but not mandatory.
Contact: Tal Peer, Prof. Timo Gerkmann (UHH)
Proposal 3: Sound Source Localization and Tracking
Humans are remarkably skilled at localizing sound sources in complex acoustic environments. This ability is largely due to spatial cues, such as time and level differences, that our brain can interpret. In technical systems, microphone arrays enable similar spatial processing, allowing algorithms to estimate the positions of speakers and track them over time. Accurate and low-latency speaker tracking is a key component in many downstream applications, such as steering spatially selective filters (SSFs) that enhance a specific target speaker in multi-speaker scenarios.
Traditional localization methods rely on statistical models and techniques such as time difference of arrival (TDOA) estimation, steered response power (SRP), or subspace-based methods like MUSIC. While these techniques can be effective, they are often based on oversimplified statistical assumptions and struggle under real-world conditions. On the other hand, data-driven approaches can learn complex mappings between spatial features and speaker positions, thereby improving robustness in challenging noisy and reverberant environments. However, since they are usually very resource intensive, their practical applicability is limited. Hybrid methods address this issue by incorporating lightweight neural networks (NNs) into traditional estimation frameworks to refine the underlying statistical models while retaining a small computational overhead.
In this thesis, we will investigate low-complexity speaker localization and tracking algorithms that can reliably provide target direction estimates in real-time. These estimates are intended to guide deep SSFs in order to continuously enhance a target speaker in dynamic multi-speaker scenarios. Instead of completely replacing classical spatial filtering with deep learning, the focus will be on developing hybrid approaches as a lightweight front-end. The aim is to explore the trade-off between localization accuracy, robustness, and algorithmic complexity in realistic acoustic settings.
Basic knowledge of statistical signal processing, machine learning, and programming experience in Python is required for this thesis. Familiarity with machine learning frameworks such as PyTorch is an advantage but not strictly necessary.
Contact: Jakob Kienegger, Alina Mannanova, Prof. Timo Gerkmann
Proposal 4: Multichannel Speech Enhancement
Humans have impressive capabilities to filter background noise when focussing on a specific target speaker. This is largely due to the fact that humans have two ears that allow for spatial processing of the received sound. Many classical multichannel speech enhancement algorithms are also based on this approach. So-called beamformers (e.g. delay-and-sum beamformer or minimum variance distortionless response beamformer) enhance a speech signal by emphasizing a signal from the desired direction and attenuating signals from other directions.
In the field of single-channel speech enhancement, the use of machine learning techniques and in particular deep neural networks (DNNs) led to significant improvements. These networks can be trained to learn arbitrarily complex nonlinear functions. As such, it seems possible to improve the multichannel speech enhancement performance by using DNNs to learn spatial filters that are not restricted to a linear processing model as beamformers are.
In this thesis, we will investigate the potential of DNNs for multichannel speech enhancement. As a first step, existing DNN-based multichannel speech enhancement algorithms will be implemented and evaluated. Of particular importance is the question under which circumstances the ML-based approaches outperform existing methods. The gained insights and experience can then be used to improve the existing DNN-based algorithms.
Basic knowledge of signal processing, machine learning, and programming experience in any language (most preferable Python) is required for this thesis. Experience with a machine learning toolbox (e.g. PyTorch or TensorFlow) is helpful but not mandatory.
Contact: Alina Mannanova, Jakob Kienegger, Prof. Timo Gerkmann
Proposal 6: Audio-Visual Emotion Recognition (Social Signal Processing)
Overview
Emerging technologies such as virtual assistants (e.g. Siri and Alexa) and collaboration tools (e.g. Zoom and Discord) have enriched a large part of our lives. Traditionally, these technologies have focused majorly on understanding user commands and supporting their intended tasks. However, more recently, researchers in the fields of psychology and machine learning have been pushing towards the development of socially intelligent systems capable of understanding the emotional states of users as expressed by non-verbal cues. This interdisciplinary research field combining psychology and machine learning is termed Social Signal Processing [1].
Emotions are complex states of feeling, positive or negative, that result in physical and psychological changes that influence our behavior. Such emotional states are communicated by an individual through both verbal and non-verbal behavior. Emotions are abound across modalities- audio prosody, speech semantic content, facial reactions, and body language. Empirical findings in the literature reveal that different dimensions of emotion are best expressed by certain modalities. For example, the arousal dimension (activation-deactivation dimension of emotion) is best recognized by the audio modality over the video modality, and the valence dimension (pleasure-displeasure dimension of emotion) is best recognized by the video modality.
A review of emotion recognition techniques can be found in [2]. Recent work from our lab on speech emotion recognition can be found in [3].
Objectives
Recent literature reveals progress towards multimodal approaches for emotion recognition, thereby co-jointly and effectively modeling different dimensions of emotions. However, several challenges in a multimodal approach exist, such as,
- usage of appropriate modality fusion strategies,
- automatic alignment of different modalities,
- machine learning architectures for respective modalities, and,
- disentanglement or joint-training strategies for respective emotion dimensions.
In this project, we will explore multimodal approaches for emotion recognition, by aptly fusing different modalities and jointly modeling different dimensions of emotion.
First, existing approaches will be implemented and analyzed with respect to their strengths and weaknesses. Building on this, novel concepts to push the limitations of the current state-of-the-art will be developed and realized. Finally, the developed algorithm(s) will be evaluated and compared to existing approaches by means of quantitative and qualitative analysis of the results. Depending on the outcome of this work, we encourage and strive for a publication of the results at a scientific conference or journal.
Skills
- Programming experience (any language, preferably Python or Matlab)
- Mastery of a machine learning framework e.g. Pytorch, Tensorflow
- Experience and basic knowledge of signal processing are definitely helpful but not mandatory
References
- A. Vinciarelli, M. Pantic, and H. Bourlard. Social signal processing: Survey of an emerging domain. Image and vision computing, 27(12):1743–1759, 2009.
- B. W. Schuller, “Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol. 61, no. 5, pp. 90–99, Apr. 2018.
- N. Raj Prabhu, G. Carbajal, N. Lehmann-Willenbrock, & T. Gerkmann. (2021). End-to-end label uncertainty modeling for speech emotion recognition using Bayesian neural networks. arXiv preprint arXiv:2110.03299, https://arxiv.org/abs/2110.03299.
Contact
Proposal 5: Speech Dereverberation for Hearing Aids
Overview
Reverberation is an acoustic phenomenon occurring when a sound signal reflects on walls and obstacles in a specular or diffuse way, which boils down to the so-called reverberated sound being the output of a convolutive filter whose Impulse Response depends on the geometry of the scenery and the materials constituting the different reflectors. Although it is a linear process, it is difficult to extract the original anechoic sound from its reverberated version in a blind scenario, that is, without knowing the so-called Room Impulse Response. Reverberation is perceived as natural in most scenari for unimpaired listeners, however, a strong reverberation will dramatically impact the speech processing performances of hearing-impaired listeners (speaker identification and separation, speech enhancement...)
Objectives
The objective of this PhD thesis is to design AI-based dereverberation algorithms for the hearing industry: in particular these algorithms should be real-time capable and robust to conditions encountered in real-life situations such as heavy reverberation, non-stationary noise, multiple sources and other interferences. They should also be lightweight enough to run on embedded devices such as cochlear implants or hearing aids.
Skills
-
Required skills
- Basic knowledge of signal processing
- Mathematical background in probabilistic theory and machine learning
- Programming experience (any language, preferably Python or Matlab)
-
Useful skills
- Theoretical background in room acoustics
- Mastery of a machine learning framework e.g. Pytorch, Tensorflow
Contact
Proposal 7: Sequence Modeling for Speech and Language Processing
Speech processing tasks have benefited greatly from various sequence modeling techniques, from recurrent neural networks (LTSMs, GRUs) and temporal convolutional networks (TCNs) to attention-based (Transformer). Each technique has its unique characteristics and computational advantages and disadvantages. Speech signals contain complex structures at different levels of abstraction, from acoustic to linguistic and semantic information, and from local to global information. One (or more) of these aspects might be more relevant for each speech-related task, and might be modeled better by specific methods.
In this thesis, we would like to investigate how different architectures handle the various types of information contained in speech signals and potentially use our findings to better and more efficiently tackle various speech-related tasks, especially speech enhancement, separation and emotion recognition. In this context, large pre-trained speech models such as HuBERT and WavLM can also be explored.
Basic knowledge of machine learning and experience with Python programming are required. Knowledge of signal processing and/or experience with Python deep learning libraries is a plus.
Contact: Danilo de Oliveira, Prof. Timo Gerkmann