Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain
Update: The website for the extended Journal paper with improved performance can be found here
Result comparison
In the tables below, we provide example audio files for our proposed Speech Enhancement work, SGMSE [1], in comparison to the baseline methods DiffuSE [2] and CDiffuSE [3].
Mixed outputs
As utilized by DiffuSE [2] and CDiffuSE [3], the table below compares the clean and noisy audio with the enhanced estimate x̂m = (0.8x̂ + 0.2y), where x̂ is the raw estimate from each method and y is the noisy audio:
Filename | Clean | Noisy | SGMSEm (proposed) [1] | DiffuSEm [2] | CDiffuSEm [3] |
---|---|---|---|---|---|
p257_028.wav | |||||
p257_087.wav | |||||
p257_189.wav | |||||
p257_388.wav | |||||
p232_035.wav | |||||
p232_050.wav | |||||
p232_186.wav | |||||
p232_314.wav |
Raw outputs
The table below compares the clean and noisy audio against the raw enhanced estimate x̂ from each method:
Filename | Clean | Noisy | SGMSE (proposed) [1] | DiffuSE [2] | CDiffuSE [3] |
---|---|---|---|---|---|
p257_028.wav | |||||
p257_087.wav | |||||
p257_189.wav | |||||
p257_388.wav | |||||
p232_035.wav | |||||
p232_050.wav | |||||
p232_186.wav | |||||
p232_314.wav |
References
[1] S. Welker, J. Richter, and T. Gerkmann, “Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain,” arXiv preprint arXiv:2203.17004 [cs, eess], Mar 2022. http://arxiv.org/abs/2203.17004
[2] Y.-J. Lu, Y. Tsao, and S. Watanabe, “A study on speech enhancement based on diffusion probabilistic model,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2021, pp. 659–666.
[3] Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, “Conditional Diffusion Probabilistic Model for Speech Enhancement,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Feb 2022.