Automation & Computer Technologies

Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription

Expand
  • 1. Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230026, China; 2. China Merchants Bank, Shenzhen 518048, Guangdong, China

Received date: 2023-12-19

  Accepted date: 2024-01-05

  Online published: 2024-04-03

Abstract

This paper describes a speaker-attributed automatic speech recognition (SA-ASR) system submitted to the multi-channel multi-party meeting transcription challenge, which aims to address the “who spoke what” problem. We align the serialized output training-based multi-speaker ASR hypotheses and speaker diarization (SD) results to obtain speaker-attributed transcriptions. We use a pre-trained multi-frame cross-channel attention (MFCCA) model as the ASR module. We build a cascade system which includes a pre-trained speaker overlapaware neural diarization and target-speaker voice activity detection model as the SD module. Decoding and alignment strategies are further used to improve the SA-ASR performance. Our proposed system outperforms the baseline with a relative improvement of 40.3% in terms of concatenated minimum-permutation character error rate on the AliMeeting dataset, which ranks top-3 on the fixed sub-track.

Cite this article

Xu Luzhen, Yan Haoyin, He Maokui, Guo Zixian, Zhou Yeping, Liu Peiqi, Zhang Jie, Dai Lirong . Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription[J]. Journal of Shanghai Jiaotong University(Science), 2026 , 31(2) : 298 -304 . DOI: 10.1007/s12204-024-2715-2

References

1. FISCUS J G, AJOT J, GAROFOLO J S. The rich transcription 2007 meeting recognition evaluation [M]//Multimodal technologies for perception of humans. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008: 373-389.
2. YU D, CHANG X K, QIAN Y M. Recognizing multi-talker speech with permutation invariant training [C]//Interspeech 2017. Stockholm: ISCA, 2017: 2456-2460.
3. SHI M, DU Z, CHEN Q, et al. CASA-ASR: Context-aware speaker-attributed ASR [DB/OL]. (2023-05-21). https://arxiv.org/abs/2305.12459
4. SEKI H, HORI T, WATANABE S, et al. A purely end-to-end system for multi-speaker speech recognition[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: ACL, 2018: 2620-2630. 
5. KANDA N, GAUR Y, WANG X F, et al. Serialized output training for end-to-end overlapped speech recognition [C]//Interspeech 2020. Shanghai: ISCA, 2020: 2797-2801.
6. YU F, ZHANG S L, FU Y H, et al. M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge [DB/OL]. (2021-10-14). http://arxiv.org/abs/2110.07393
7. YU F, ZHANG S L, GUO P C, et al. Summary on the ICASSP 2022 multi-channel multi-party meeting transcription grand challenge [DB/OL]. (2022-02-08). http://arxiv.org/abs/2202.03647
8. YU F, DU Z H, ZHANG S L, et al. A comparative study on speaker-attributed automatic speech recognition in multi-party meetings [C]//Interspeech 2022. Incheon: ISCA, 2022: 560-564.
9. FU Y H, CHENG L Y, LV S B, et al. AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario [C]//Interspeech 2021. Brno: ISCA, 2021: 3665-3669.
10. FAN Y, KANG J W, LI L T, et al. CN-CELEB: A challenging Chinese speaker recognition dataset [DB/OL]. (2019-10-31). http://arxiv.org/abs/1911.01799
11. HE M K, LV X, ZHOU W L, et al. The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge [DB/OL]. (2022-02-10). http://arxiv.org/abs/2202.04855
12. DU Z H, ZHANG S L, ZHENG S Q, et al. Speaker overlap-aware neural diarization for multi-party meeting analysis [C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi,: ACL, 2022: 7458-7469.
13. SHI M H, ZHANG J, DU Z H, et al. A comparative study on multichannel speaker-attributed automatic speech recognition in multi-party meetings [DB/OL]. (2022-11-01). http://arxiv.org/abs/2211.00511
14. YU F, ZHANG S L, GUO P C, et al. MFCCA: Multi-frame cross-channel attention for multi-speaker ASR in multi-party meeting scenario [C]//2022 IEEE Spoken Language Technology Workshop. Doha: IEEE, 2023: 144-151.
15. CHANG F J, RADFAR M, MOUCHTARIS A, et al. Multi-channel transformer transducer for speech recognition [C]//Interspeech 2021. Brno: ISCA, 2021: 296-300.
16. WANG W Q, QIN X Y, LI M. Cross-channel attention-based target speaker voice activity detection: Experimental results for the M2met challenge [C]//ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 9171-9175.
17. MEDENNIKOV I, KORENEVSKY M, PRISYACH T, et al. Target-speaker voice activity detection: A novel approach for multi-speaker diarization in a dinner party scenario [C]//Interspeech 2020. Shanghai: ISCA, 2020: 274-278.
18. HE M K, RAJ D, HUANG Z L, et al. Target-speaker voice activity detection with improved i-vector estimation for unknown number of speaker [C]//Interspeech 2021. Brno: ISCA, 2021: 3555-3559.
19. SNYDER D, CHEN G G, POVEY D. MUSAN: A music, speech, and noise corpus [DB/OL]. (2015-10-28). http://arxiv.org/abs/1510.08484
20. YOSHIOKA T, NAKATANI T. Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(10): 2707-2720.
21. ZHANG H Y, CISSE M, DAUPHIN Y N, et al. Mixup: Beyond empirical risk minimization [DB/OL]. (2017-10-25). http://arxiv.org/abs/1710.09412
22. KINGMA D P, BA J. Adam: A method for stochastic optimization [DB/OL]. (2014-12-22). http://arxiv.org/abs/1412.6980
23. FISCUS J G, AJOT J, MICHEL M, et al. The rich transcription 2006 spring meeting recognition evaluation [M]//Machine learning for multimodal interaction. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006: 309-322.

Outlines

/