Teaching

Courses

LECTURES / VORLESUNGEN

Statistical Signal Processing + Excercises/Übungen	M.Sc. Winter Semester
Speech Signal Processing / Sprachsignalverarbeitung + Excercises/Übungen	M.Sc. Summer Semester
Foundations of Data Analytics (FDA) + Exercises/Übungen	M.Sc. Winter Semester
Digital Media Signal Processing/ Digitale Mediensignalverarbeitung + Excercises/Übungen	B.Sc. Summer Semester
Mathematik der Informatik für Studierende des Lehramts + integrierte Übung	B.Sc. Summer Semester

SEMINARS / SEMINARE

Data Science	M.Sc. Summer Semester
Aktuelle Themen der Audiosignalverarbeitung	B.Sc. Summer Semester

PRAKTIKA

Praktikum Audiosignalverarbeitung

B.Sc. Winter Semester

PROJECTS / PROJEKTE, for example

Deep Learning for Audio Processing + integriertes Seminar	B.Sc. Winter Semester
Deep Learning for Multimodal Data Science + integriertes Seminar	M.Sc. Winter Semester

B.Sc. and M.Sc. Topic Proposals

Here you can find what which topics our students are currently working on. If you are interested in one of these topics, please reach out to the listed contact person. If you have your own idea for a thesis topic in the field of signal processing, please contact us and we will see how we can develop your idea together.

Proposal 1: Multi-modal Signal Processing (Audio, Video, Text)

Speech processing algorithms are an integral part of many technical devices that are ubiquitous in the lives of many people, like smart phones (e.g. speech recognition and telephony) or hearing aids.

Most of these algorithms rely on audio data alone, i.e. microphone recordings. In very noisy situations, e.g. in a crowded restaurant or on a busy street, however, the desired speech signal at the microphone can be severely corrupted by the undesired noise and the performance of speech processing algorithms may drop significantly. In such acoustically challenging situations, humans are known to also utilize visual cues, most prominently lip reading to still be able to comprehend what has been said. So far, this additional source of information is neglected by the vast majority of mainstream approaches. Additionally, knowledge of linguistic strutures can be leveraged to fill in the gaps of the communication medium, ideally without having to resort to complex pipelines with automatic speech recognition and text large language models (LLMs).

In this thesis, we will explore ways to improve speech processing by also utilizing visual and linguistic information especially when the audio signal is severely distorted. The major question in this context is, how the information obtained from the different modalities can be combined. While there are many sophisticated lip reading algorithms as well as audio-only speech processing algorithms and a multitude of text LLMs out there, there are only a few methods that use information from the one modality to benefit the processing of the other. In this thesis, you will first implement and thoroughly evaluate one or more reference approaches, identifying their strengths and weaknesses. Based on this analysis, we will strive for novel ways to improve the performance of theses approaches.

Basic knowledge of signal processing as well as programming skills, preferably Python, are a definite plus.

Contact: Danilo de Oliveira, Prof. Timo Gerkmann

Proposal 2: Phase-Aware Speech Enhancement

The ease of speech understanding in noise degrades greatly with decreasing signal-to-noise ratios. This effect is even more severe for hearing impaired people, hearing aid users and cochlear implant users. Noise reduction algorithms aim at reducing the noise to facilitate speech communication. This is particularly difficult when the noise signal is highly variant.

Many speech enhancement algorithms are based on a representation of the noisy speech signal in the short-time Fourier transform domain (STFT) domain. In this domain, speech is represented by the STFT magnitude and the STFT phase. In the last two decades research in speech enhancement algorithms mainly focused on improving the STFT magnitude, while the STFT phase was left unchanged. However, more recently, researchers pointed out that phase processing may improve speech enhancement algorithms further.

This thesis aims at implementing, analyzing and developing algorithms for phase processing. The overall goal is to obtain intelligible, high quality speech signals even from heavily distorted recordings in real-time. Of special interest in this context are the robust estimation of the clean speech phase from the noise-corrupted recording, the interplay between the traditional enhancement of spectral amplitudes and the recent developments in spectral phase enhancement, as well as the potential of modern machine learning techniques like deep learning.

First, existing approaches will be implemented and analyzed with respect to their strengths and weaknesses. Starting from this, new concepts to push the limits of the current state-of-the-art will be developed and realized. Finally, the derived algorithm(s) will evaluated and compared to existing approaches by means of instrumental measures and listening experiments. Depending on the outcome of this work, we encourage and strive for a publication of the results at a scientific conference or journal.

Experience and basic knowledge of signal processing are definitely helpful but not mandatory.

Contact: Tal Peer, Prof. Timo Gerkmann (UHH)

Proposal 3: Sound Source Localization and Tracking

Humans are remarkably skilled at localizing sound sources in complex acoustic environments. This ability is largely due to spatial cues, such as time and level differences, that our brain can interpret. In technical systems, microphone arrays enable similar spatial processing, allowing algorithms to estimate the positions of speakers and track them over time. Accurate and low-latency speaker tracking is a key component in many downstream applications, such as steering spatially selective filters (SSFs) that enhance a specific target speaker in multi-speaker scenarios.

Traditional localization methods rely on statistical models and techniques such as time difference of arrival (TDOA) estimation, steered response power (SRP), or subspace-based methods like MUSIC. While these techniques can be effective, they are often based on oversimplified statistical assumptions and struggle under real-world conditions. On the other hand, data-driven approaches can learn complex mappings between spatial features and speaker positions, thereby improving robustness in challenging noisy and reverberant environments. However, since they are usually very resource intensive, their practical applicability is limited. Hybrid methods address this issue by incorporating lightweight neural networks (NNs) into traditional estimation frameworks to refine the underlying statistical models while retaining a small computational overhead.

In this thesis, we will investigate low-complexity speaker localization and tracking algorithms that can reliably provide target direction estimates in real-time. These estimates are intended to guide deep SSFs in order to continuously enhance a target speaker in dynamic multi-speaker scenarios. Instead of completely replacing classical spatial filtering with deep learning, the focus will be on developing hybrid approaches as a lightweight front-end. The aim is to explore the trade-off between localization accuracy, robustness, and algorithmic complexity in realistic acoustic settings.

Basic knowledge of statistical signal processing, machine learning, and programming experience in Python is required for this thesis. Familiarity with machine learning frameworks such as PyTorch is an advantage but not strictly necessary.

Contact: Jakob Kienegger, Alina Mannanova, Prof. Timo Gerkmann

Proposal 4: Multichannel Speech Enhancement

Proposal 5: Speech Dereverberation for Hearing Aids

Proposal 6: Audio-Visual Emotion Recognition (Social Signal Processing)

Overview

Emerging technologies such as virtual assistants (e.g. Siri and Alexa) and collaboration tools (e.g. Zoom and Discord) have enriched a large part of our lives. Traditionally, these technologies have focused majorly on understanding user commands and supporting their intended tasks. However, more recently, researchers in the fields of psychology and machine learning have been pushing towards the development of socially intelligent systems capable of understanding the emotional states of users as expressed by non-verbal cues. This interdisciplinary research field combining psychology and machine learning is termed Social Signal Processing [1].

Emotions are complex states of feeling, positive or negative, that result in physical and psychological changes that influence our behavior. Such emotional states are communicated by an individual through both verbal and non-verbal behavior. Emotions are abound across modalities- audio prosody, speech semantic content, facial reactions, and body language. Empirical findings in the literature reveal that different dimensions of emotion are best expressed by certain modalities. For example, the arousal dimension (activation-deactivation dimension of emotion) is best recognized by the audio modality over the video modality, and the valence dimension (pleasure-displeasure dimension of emotion) is best recognized by the video modality.

A review of emotion recognition techniques can be found in [2]. Recent work from our lab on speech emotion recognition can be found in [3].

Objectives

Recent literature reveals progress towards multimodal approaches for emotion recognition, thereby co-jointly and effectively modeling different dimensions of emotions. However, several challenges in a multimodal approach exist, such as,

usage of appropriate modality fusion strategies,
automatic alignment of different modalities,
machine learning architectures for respective modalities, and,
disentanglement or joint-training strategies for respective emotion dimensions.

In this project, we will explore multimodal approaches for emotion recognition, by aptly fusing different modalities and jointly modeling different dimensions of emotion.

First, existing approaches will be implemented and analyzed with respect to their strengths and weaknesses. Building on this, novel concepts to push the limitations of the current state-of-the-art will be developed and realized. Finally, the developed algorithm(s) will be evaluated and compared to existing approaches by means of quantitative and qualitative analysis of the results. Depending on the outcome of this work, we encourage and strive for a publication of the results at a scientific conference or journal.

Skills

Programming experience (any language, preferably Python or Matlab)
Mastery of a machine learning framework e.g. Pytorch, Tensorflow
Experience and basic knowledge of signal processing are definitely helpful but not mandatory

References

A. Vinciarelli, M. Pantic, and H. Bourlard. Social signal processing: Survey of an emerging domain. Image and vision computing, 27(12):1743–1759, 2009.
B. W. Schuller, “Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol. 61, no. 5, pp. 90–99, Apr. 2018.
N. Raj Prabhu, G. Carbajal, N. Lehmann-Willenbrock, & T. Gerkmann. (2021). End-to-end label uncertainty modeling for speech emotion recognition using Bayesian neural networks. arXiv preprint arXiv:2110.03299, https://arxiv.org/abs/2110.03299.

Teaching

Courses

B.Sc. and M.Sc. Topic Proposals

Proposal 1: Multi-modal Signal Processing (Audio, Video, Text)

Proposal 2: Phase-Aware Speech Enhancement

Proposal 3: Sound Source Localization and Tracking

Proposal 4: Multichannel Speech Enhancement

Proposal 5: Speech Dereverberation for Hearing Aids

Overview

Objectives

Skills

Required skills

Useful skills

Contact

Proposal 6: Audio-Visual Emotion Recognition (Social Signal Processing)

Overview

Objectives

Skills

References

Contact

Proposal 7: Real-Time Diffusion-based Speech Enhancement