Score-based Generative Models (Diffusion Models) for Speech Enhancement and Dereverberation
This website contains supplementary material to the papers:
- Simon Welker, Julius Richter, Timo Gerkmann, Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain, ISCA Interspeech, Incheon, Korea, Sep. 2022. [1]
- Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Timo Gerkmann, Speech Enhancement and Dereverberation with Diffusion-Based Generative Models, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351 - 2364, 2023. [2]
Code
The code is availabe at https://github.com/sp-uhh/sgmse
Denoising examples
Full method comparison
The methods written in black represent the matched condition, i.e., trained and tested on WSJ0-CHiME3. The grayed out methods represent the mismatched condition, i.e., the model was trained on VoiceBank-DEMAND and tested on WSJ0-CHiME3.
Female speakers:
Matched? | 441c0204.wav | 445o030y.wav | 444o030c.wav | 445o0308.wav | |
---|---|---|---|---|---|
Input SNR | 0.41 dB | 2.56 dB | 4.46 dB | 9.97 dB | |
Clean | |||||
Noisy | |||||
SGMSE+ [2] | ✓ | ||||
SGMSE+ [2] | ✗ | ||||
Conv-TasNet [3] | ✓ | ||||
Conv-TasNet [3] | ✗ | ||||
MetricGAN+ [4] | ✓ | ||||
MetricGAN+ [4] | ✗ | ||||
SGMSE [1] | ✓ | ||||
SGMSE [1] | ✗ | ||||
CDiffuSE [5] | ✓ | ||||
CDiffuSE [5] | ✗ | ||||
STCN [6] | ✓ | ||||
STCN [6] | ✗ | ||||
RVAE [7] | ✓ | ||||
RVAE [7] | ✗ |
Male speakers:
Matched? | 440o0304.wav | 447c0201.wav | 447o030t.wav | 440c020v.wav | |
---|---|---|---|---|---|
Input SNR | 0.28 dB | 2.54 dB | 4.27 dB | 6.68 dB | |
Clean | |||||
Noisy | |||||
SGMSE+ [2] | ✓ | ||||
SGMSE+ [2] | ✗ | ||||
Conv-TasNet [3] | ✓ | ||||
Conv-TasNet [3] | ✗ | ||||
MetricGAN+ [4] | ✓ | ||||
MetricGAN+ [4] | ✗ | ||||
SGMSE [1] | ✓ | ||||
SGMSE [1] | ✗ | ||||
CDiffuSE [5] | ✓ | ||||
CDiffuSE [5] | ✗ | ||||
STCN [6] | ✓ | ||||
STCN [6] | ✗ | ||||
RVAE [7] | ✓ | ||||
RVAE [7] | ✗ |
Dereverberation examples
Real data
Speech Enhancement
Examples are taken from the DNS challenge 2020 testset [8].File name | Noisy | SGMSE+ (WSJ0+CHiME3) | SGMSE+ (VB-DMD) |
---|---|---|---|
audioset_realrec_airconditioner_9akKYWm_f9E.wav | |||
audioset_realrec_airconditioner_DMZAnHsY8e8.wav | |||
audioset_realrec_babycry_3oS1DK_35z4.wav | |||
audioset_realrec_barking_Cp9Vfz2viUw.wav | |||
audioset_realrec_car_00fs8Gpipss.wav | |||
audioset_realrec_car_1B0WiVPQ7ro.wav | |||
audioset_realrec_clatter_7KUIyRIW4gM.wav | |||
audioset_realrec_printer_1wh_xYxrwg8.wav | |||
audioset_realrec_printer_2BIihAdg5TQ.wav | |||
ms_realrec_headset_cafe_spk1_3.wav | |||
ms_realrec_headset_Headphone-cafeteria-spk3-1.wav | |||
ms_realrec_headset_roger_cafeteria_1.wav | |||
ms_realrec_speakerphone_Ebrahim_B31Kitchen_S11_SurfaceBook.wav | |||
Dereverberation
Examples are taken from the MC-WSJ-AV testset [9].File name | Reverberant | SGMSE+ |
---|---|---|
AMI_WSJ22-Array1-1_T22c0302.wav | ||
AMI_WSJ22-Array1-1_T22c0308.wav | ||
AMI_WSJ23-Array1-1_T23c0309.wav | ||
AMI_WSJ23-Array1-1_T23c0314.wav | ||
AMI_WSJ27-Array1-1_T37c030w.wav | ||
AMI_WSJ28-Array1-1_T38c030v.wav | ||
AMI_WSJ28-Array1-1_T38c0301.wav | ||
AMI_WSJ29-Array1-1_T39c030t.wav | ||
AMI_WSJ29-Array1-1_T39c030w.wav | ||
AMI_WSJ29-Array1-1_T39c030x.wav | ||
AMI_WSJ30-Array1-1_T40c0309.wav |
References
[1] Simon Welker, Julius Richter, and Timo Gerkmann. Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain, ISCA Interspeech, Incheon, Korea, Sep. 2022.
[2] Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, and Timo Gerkmann. Speech Enhancement and Dereverberation with Diffusion-Based Generative Models, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351 - 2364, 2023.
[3] Yi Luo, and Nima Mesgarani. Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019.
[4] Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, and Yu Tsao. MetricGAN+: An improved version of MetricGAN for speech enhancement, arXiv preprint arXiv:2104.03538, 2021.
[5] Yen-Ju Lu, Zhong-Qiu Wang, Shinji Watanabe, Alexander Richard, Cheng Yu, and Yu Tsao. Conditional diffusion probabilistic model for speech enhancement, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
[6] Julius Richter, Guillaume Carbajal, and Timo Gerkmann. Speech Enhancement with Stochastic Temporal Convolutional Networks, ISCA Interspeech, 2020.
[7] Xiaoyu Bie, Simon Leglaive, Xavier Alameda-Pineda, and Laurent Girin. Unsupervised speech enhancement using dynamical variational auto-encoders, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2993 - 3007, 2022.
[8] Chandan Reddy et al. The Interspeech 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results. ISCA Interspeech, 2020.
[9] Mike Lincoln et al. The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments. IEEE Workshop on Automatic Speech Recognition and Understanding, 2005.