Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain

Update: The website for the extended Journal paper with improved performance can be found here

Result comparison

In the tables below, we provide example audio files for our proposed Speech Enhancement work, SGMSE [1], in comparison to the baseline methods DiffuSE [2] and CDiffuSE [3].

Mixed outputs

As utilized by DiffuSE [2] and CDiffuSE [3], the table below compares the clean and noisy audio with the enhanced estimate x̂_m = (0.8x̂ + 0.2y), where x̂ is the raw estimate from each method and y is the noisy audio:

Filename	Clean	Noisy	SGMSE_m (proposed) [1]	DiffuSE_m [2]	CDiffuSE_m [3]
p257_028.wav
p257_087.wav
p257_189.wav
p257_388.wav
p232_035.wav
p232_050.wav
p232_186.wav
p232_314.wav

Raw outputs

The table below compares the clean and noisy audio against the raw enhanced estimate x̂ from each method:

Filename	Clean	Noisy	SGMSE (proposed) [1]	DiffuSE [2]	CDiffuSE [3]
p257_028.wav
p257_087.wav
p257_189.wav
p257_388.wav
p232_035.wav
p232_050.wav
p232_186.wav
p232_314.wav

References

[1] S. Welker, J. Richter, and T. Gerkmann, “Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain,” arXiv preprint arXiv:2203.17004 [cs, eess], Mar 2022. http://arxiv.org/abs/2203.17004

[2] Y.-J. Lu, Y. Tsao, and S. Watanabe, “A study on speech enhancement based on diffusion probabilistic model,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2021, pp. 659–666.

[3] Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, “Conditional Diffusion Probabilistic Model for Speech Enhancement,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Feb 2022.