RTFS-Net: Recurrent time-frequency modelling for efficient audio-visual speech separation

Speech Samples

Five models were evaluated with Youtube videos from: VoxCeleb2 Dataset [1]:

AV-ConvTasNet [2]
Visualvoice [3]
AVLIT [4]
CTCNet [5]
RTFSNet

Here are some examples of interactive multimodal speech separation. You can use the mouse to
hover over the lips of a speaker to hear the separated sound.

Demo 1

AV-ConvTasNet Visualvoice AVLIT CTCNet RTFSNet

Demo 2

AV-ConvTasNet Visualvoice AVLIT CTCNet RTFSNet

Demo 3

AV-ConvTasNet Visualvoice AVLIT CTCNet RTFSNet

Demo4

AV-ConvTasNet Visualvoice AVLIT CTCNet RTFSNet

References

[1] J Chung, A Nagrani, and A Zisserman. Voxceleb2: Deep speaker recognition. In Interspeech, 2018.
[2] Jian Wu, Yong Xu, Shi-Xiong Zhang, Lian-Wu Chen, Meng Yu, Lei Xie, and Dong Yu. Time domain audio visual speech separation. In ASRU, pp. 667–673, 2019.
[3] Ruohan Gao and Kristen Grauman. Visualvoice: Audio-visual speech separation with crossmodal consistency. In CVPR, pp. 15490–15500. IEEE, 2021.
[4] Hector Martel, Julius Richter, Kai Li, Xiaolin Hu, and Timo Gerkmann. Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model. In Proc. INTERSPEECH 2023, pp. 1673–1677, 2023.
[5] Kai Li, Fenghua Xie, Hang Chen, Kexin Yuan, and Xiaolin Hu. An audio-visual speech separation model inspired by cortico-thalamo-cortical circuits. arXiv preprint arXiv:2212.10744, 2022.