Kolloquium SoSe 2025

Prof. Dr. Joseph (Yossi) Keshet

Technion, Haifa, Israel

Wann: 07.07.2025, 17:15 Uhr

Wo: Konrad-Zuse-Hörsaal (Raum B-201)

Thema

From Raw Waveform to Spectrum: Practical and Theoretical Advances in Diffusion Models for Speech Generation

Sprache: English

Abstract

In this talk, I will present two complementary contributions that push the boundaries of diffusion models for speech generation. I will start by presenting DiffAR, an autoregressive diffusion model capable of generating high-fidelity raw speech waveforms end-to-end. By operating directly in the waveform domain and conditioning on overlapping frames, DiffAR achieves coherent, expressive, and naturally varied speech generation. Specifically, it allows the creation of local acoustic behaviors, like vocal fry, which makes the overall waveform sounds more natural.

Second, I will introduce a novel spectral analysis framework that interprets the inference process of diffusion models through a frequency-domain lens. This perspective enables principled design of noise schedules that are aligned with the spectral characteristics of the target data, replacing empirical heuristics with theoretically grounded methods.

These works were conducted in collaboration with Roi Benita and Michael Elad, and are detailed in the following papers: https://arxiv.org/abs/2310.01381, https://arxiv.org/abs/2502.00180

Bio

I am an Associate Professor at the Andrew and Erna Viterbi Faculty of Electrical and Computer Engineering. I am the director of the Speech, Language, and Deep Learning Lab and affiliated with the Signal and Image processing Lab (SIPL).

I am excited about human speech. Speech is one of the most trivial yet one of the most complex signals we know. It bears information that the speaker would like to convey as well as her identity and her emotional and medical state. My research interests are driven by my passion for understanding and quantifying speech. My research concerns both machine learning and the computational study of human speech and language. My work on speech and language concentrates on speech processing, automatic speech recognition, speech synthesis, speaker recognition, automating laboratory phonology, and pathological speech. My research on machine learning focuses on core machine learning and deep learning algorithms, specifically, that capture the structure of complex tasks, such as automatic speech recognition. But also – how to make them reliable and trustworthy.

I received my BSc and MSc from Tel Aviv University in 1994 and 2002, respectively, and PhD from the Hebrew University of Jerusalem in 2008. My advisor was Yoram Singer. I did my postdoc with Hynek Hermansky at IDIAP Research Institute at EPFL and I was an Assistant Research Professor at TTI and the University of Chicago. Before moving to the Technion I was Associate Professor at the Department of Computer Science at Bar-Ilan University.