6 Papers at ICASSP 2026
30 January 2026, by David Mosteller
Excited to share that our team will be presenting six papers at ICASSP 2026. Congratulations and thanks to all co-authors for their excellent work.
• Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?
Rostislav Makarov, Lea Schönherr, Timo Gerkmann
[audio], [arxiv]
We show that advanced speech enhancement systems can be manipulated with psychoacoustically masked adversarial noise, leading to semantic changes in the enhanced output. Diffusion models, however, show inherent robustness.
• Are These Even Words? Quantifying the Gibberishness of Generative Speech Models
Danilo de Oliveira, Tal Peer, Jonas Rochdi, Timo Gerkmann
[audio], [arxiv]
We study how non‑intrusive metrics behave for generative hallucinations and gibberish speech and propose a fully unsupervised method using language models. A high‑quality gibberish dataset and scoring tools are released.
• Real-Time Streaming Mel Vocoding with Generative Flow Matching
Simon Welker, Tal Peer, Timo Gerkmann
[arxiv]
We introduce MelFlow, a low‑latency flow‑matching Mel vocoder that enables real‑time streaming waveform synthesis from Mel spectrograms on consumer hardware and surpasses non‑streaming vocoders like HiFi‑GAN in speech quality.
• Adaptive Rotary Steering with Joint Autoregression for Robust Extraction of Closely Moving Speakers in Dynamic Scenarios
Jakob Kienegger, Timo Gerkmann
[audio], [arxiv]
We propose an autoregressive spatial filtering and tracking framework that handles closely spaced or crossing speakers and improves multi‑speaker extraction in dynamic conditions.
• Do We Need EMA for Diffusion-Based Speech Enhancement? Toward a Magnitude-Preserving Network Architecture
Julius Richter, Danilo de Oliveira, Timo Gerkmann
[arxiv]
We introduce EDM2SE, a Schrödinger‑bridge diffusion model with magnitude‑preserving layers and time‑dependent preconditioning, improving robustness and training stability across datasets.
• Bone-Conduction Guided Multimodal Speech Enhancement with Conditional Diffusion Models
Sina Khanagha, Bunlong Lay, Timo Gerkmann
[arxiv]
We propose a multimodal diffusion model combining bone‑conducted and air‑conducted speech, yielding strong improvements in very noisy environments.
Looking forward to seeing you in Barcelona!

