DiffPhase: Generative Diffusion-based STFT Phase Retrieval

This page accompanies the DiffPhase paper [1], presnted at ICASSP 2023, with examples of speech reconstructed from known (clean) STFT magnitudes, without prior knowledge of the phase spectrum. The following eight utterances were randomly selected from the WSJ0 test set. For each example, we provide the reconstructed signals generated by:

Two variants of the proposed DiffPhase approach [1] (DiffPhase and DiffPhase-small), after 15 and 30 reverse diffusion steps
Two variants of DeGLI (Deep Griffin-Lim Iteration [2]): The original model and a larger model, both at 15 and 30 iterations
The original Griffin-Lim algorithm (GLA) [3] at 50 and 200 iterations

Audio examples: Phase retrieval with known clean magnitudes

	Example 1	Example 2	Example 3	Example 4	Example 5	Example 6	Example 7	Example 8
Reference
Zero Phase
DiffPhase, N=15
DiffPhase, N=30
DiffPhase-small N=15
DiffPhase-small N=30
DeGLI, N=15
DeGLI, N=30
DeGLI-large, N=15
DeGLI-large, N=30
GLA, N=50
GLA, N=200

References

[1] Tal Peer, Simon Welker, Timo Gerkmann, "DiffPhase: Generative Diffusion-based STFT Phase Retrieval", IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Rhodes Island, Greece, Jun 2023.

[2] Y. Masuyama, K. Yatabe, Y. Koizumi, Y. Oikawa, and N. Harada, “Deep Griffin–Lim Iteration: Trainable Iterative Phase Reconstruction Using Neural Network,” IEEE J. Sel. Top. Signal Process., vol. 15, no. 1, pp. 37–50, Jan. 2021.

[3] D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, Apr. 1984.