基于多帧跨通道注意力和说话人日志的多通道多方会议转录说话人相关自动语音识别系统

doi:10.1007/s12204-024-2715-2

J Shanghai Jiaotong Univ Sci ›› 2026, Vol. 31 ›› Issue (2): 298-304.doi: 10.1007/s12204-024-2715-2

所属专题：人机语音通讯

基于多帧跨通道注意力和说话人日志的多通道多方会议转录说话人相关自动语音识别系统

1. 中国科学技术大学电子工程与信息科学系，合肥 230026；2. 中国招商银行，广东深圳 518048

收稿日期:2023-12-19 接受日期:2024-01-05 出版日期:2026-04-01 发布日期:2024-04-03

Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription

许露真¹，严浩尹¹，何茂奎¹，郭子娴¹，周叶萍²，刘沛奇²，张结¹，戴礼荣¹

1. Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230026, China; 2. China Merchants Bank, Shenzhen 518048, Guangdong, China

Received:2023-12-19 Accepted:2024-01-05 Online:2026-04-01 Published:2024-04-03

摘要/Abstract

摘要： 本文介绍了我们提交给多通道多方会议转录（M2MeT2.0）比赛的说话人相关自动语音识别（SA-ASR）系统，该系统旨在解决“谁说了什么”问题。将基于序列化输出训练的多说话人语音识别转录和说话人日志结果对齐，以获得说话人相关的转录。使用预训练的多帧跨通道注意力（MFCCA）模型作为语音识别模块。构建了一个级联系统，其中包括一个预训练的说话人重叠感知神经日志和目标说话人语音活动检测模型作为说话人日志模块。使用解码和对齐策略来进一步提高SA-ASR性能。提出的系统在AliMeeting数据集上的级联最小排列字符错误率方面优于基线，且相对提高了40.3%，在限定数据子赛道上排名前三。

关键词: 多通道多方会议转录（M2MET2.0）, 说话人相关自动语音识别（SA-ASR）, 序列化输出训练, 说话人日志, 级联最小排列字符错误率

Abstract: This paper describes a speaker-attributed automatic speech recognition (SA-ASR) system submitted to the multi-channel multi-party meeting transcription challenge, which aims to address the “who spoke what” problem. We align the serialized output training-based multi-speaker ASR hypotheses and speaker diarization (SD) results to obtain speaker-attributed transcriptions. We use a pre-trained multi-frame cross-channel attention (MFCCA) model as the ASR module. We build a cascade system which includes a pre-trained speaker overlapaware neural diarization and target-speaker voice activity detection model as the SD module. Decoding and alignment strategies are further used to improve the SA-ASR performance. Our proposed system outperforms the baseline with a relative improvement of 40.3% in terms of concatenated minimum-permutation character error rate on the AliMeeting dataset, which ranks top-3 on the fixed sub-track.

中图分类号:

TN912.34

. 基于多帧跨通道注意力和说话人日志的多通道多方会议转录说话人相关自动语音识别系统[J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 298-304.

Xu Luzhen, Yan Haoyin, He Maokui, Guo Zixian, Zhou Yeping, Liu Peiqi, Zhang Jie, Dai Lirong. Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription[J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 298-304.

参考文献

1. FISCUS J G, AJOT J, GAROFOLO J S. The rich transcription 2007 meeting recognition evaluation [M]//Multimodal technologies for perception of humans. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008: 373-389.
2. YU D, CHANG X K, QIAN Y M. Recognizing multi-talker speech with permutation invariant training [C]//Interspeech 2017. Stockholm: ISCA, 2017: 2456-2460.
3. SHI M, DU Z, CHEN Q, et al. CASA-ASR: Context-aware speaker-attributed ASR [DB/OL]. (2023-05-21). https://arxiv.org/abs/2305.12459
4. SEKI H, HORI T, WATANABE S, et al. A purely end-to-end system for multi-speaker speech recognition[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: ACL, 2018: 2620-2630.
5. KANDA N, GAUR Y, WANG X F, et al. Serialized output training for end-to-end overlapped speech recognition [C]//Interspeech 2020. Shanghai: ISCA, 2020: 2797-2801.
6. YU F, ZHANG S L, FU Y H, et al. M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge [DB/OL]. (2021-10-14). http://arxiv.org/abs/2110.07393
7. YU F, ZHANG S L, GUO P C, et al. Summary on the ICASSP 2022 multi-channel multi-party meeting transcription grand challenge [DB/OL]. (2022-02-08). http://arxiv.org/abs/2202.03647
8. YU F, DU Z H, ZHANG S L, et al. A comparative study on speaker-attributed automatic speech recognition in multi-party meetings [C]//Interspeech 2022. Incheon: ISCA, 2022: 560-564.
9. FU Y H, CHENG L Y, LV S B, et al. AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario [C]//Interspeech 2021. Brno: ISCA, 2021: 3665-3669.
10. FAN Y, KANG J W, LI L T, et al. CN-CELEB: A challenging Chinese speaker recognition dataset [DB/OL]. (2019-10-31). http://arxiv.org/abs/1911.01799
11. HE M K, LV X, ZHOU W L, et al. The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge [DB/OL]. (2022-02-10). http://arxiv.org/abs/2202.04855
12. DU Z H, ZHANG S L, ZHENG S Q, et al. Speaker overlap-aware neural diarization for multi-party meeting analysis [C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi,: ACL, 2022: 7458-7469.
13. SHI M H, ZHANG J, DU Z H, et al. A comparative study on multichannel speaker-attributed automatic speech recognition in multi-party meetings [DB/OL]. (2022-11-01). http://arxiv.org/abs/2211.00511
14. YU F, ZHANG S L, GUO P C, et al. MFCCA: Multi-frame cross-channel attention for multi-speaker ASR in multi-party meeting scenario [C]//2022 IEEE Spoken Language Technology Workshop. Doha: IEEE, 2023: 144-151.
15. CHANG F J, RADFAR M, MOUCHTARIS A, et al. Multi-channel transformer transducer for speech recognition [C]//Interspeech 2021. Brno: ISCA, 2021: 296-300.
16. WANG W Q, QIN X Y, LI M. Cross-channel attention-based target speaker voice activity detection: Experimental results for the M2met challenge [C]//ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 9171-9175.
17. MEDENNIKOV I, KORENEVSKY M, PRISYACH T, et al. Target-speaker voice activity detection: A novel approach for multi-speaker diarization in a dinner party scenario [C]//Interspeech 2020. Shanghai: ISCA, 2020: 274-278.
18. HE M K, RAJ D, HUANG Z L, et al. Target-speaker voice activity detection with improved i-vector estimation for unknown number of speaker [C]//Interspeech 2021. Brno: ISCA, 2021: 3555-3559.
19. SNYDER D, CHEN G G, POVEY D. MUSAN: A music, speech, and noise corpus [DB/OL]. (2015-10-28). http://arxiv.org/abs/1510.08484
20. YOSHIOKA T, NAKATANI T. Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(10): 2707-2720.
21. ZHANG H Y, CISSE M, DAUPHIN Y N, et al. Mixup: Beyond empirical risk minimization [DB/OL]. (2017-10-25). http://arxiv.org/abs/1710.09412
22. KINGMA D P, BA J. Adam: A method for stochastic optimization [DB/OL]. (2014-12-22). http://arxiv.org/abs/1412.6980
23. FISCUS J G, AJOT J, MICHEL M, et al. The rich transcription 2006 spring meeting recognition evaluation [M]//Machine learning for multimodal interaction. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006: 309-322.

基于多帧跨通道注意力和说话人日志的多通道多方会议转录说话人相关自动语音识别系统

Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 1

编辑推荐

Metrics

本文评价