J Shanghai Jiaotong Univ Sci ›› 2026, Vol. 31 ›› Issue (2): 298-304.doi: 10.1007/s12204-024-2715-2

所属专题: 人机语音通讯

• • 上一篇    下一篇

基于多帧跨通道注意力和说话人日志的多通道多方会议转录说话人相关自动语音识别系统

  

  1. 1. 中国科学技术大学 电子工程与信息科学系,合肥 230026;2. 中国招商银行,广东深圳 518048
  • 收稿日期:2023-12-19 接受日期:2024-01-05 出版日期:2026-04-01 发布日期:2024-04-03

Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription

许露真1,严浩尹1,何茂奎1,郭子娴1,周叶萍2,刘沛奇2,张结1,戴礼荣1   

  1. 1. Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230026, China; 2. China Merchants Bank, Shenzhen 518048, Guangdong, China
  • Received:2023-12-19 Accepted:2024-01-05 Online:2026-04-01 Published:2024-04-03

摘要: 本文介绍了我们提交给多通道多方会议转录(M2MeT2.0)比赛的说话人相关自动语音识别 (SA-ASR)系统,该系统旨在解决“谁说了什么”问题。将基于序列化输出训练的多说话人语音识别转录和说话人日志结果对齐,以获得说话人相关的转录。使用预训练的多帧跨通道注意力(MFCCA)模型作为语音识别模块。构建了一个级联系统,其中包括一个预训练的说话人重叠感知神经日志和目标说话人语音活动检测模型作为说话人日志模块。使用解码和对齐策略来进一步提高SA-ASR性能。提出的系统在AliMeeting数据集上的级联最小排列字符错误率方面优于基线,且相对提高了40.3%,在限定数据子赛道上排名前三。

关键词: 多通道多方会议转录(M2MET2.0), 说话人相关自动语音识别(SA-ASR), 序列化输出训练, 说话人日志, 级联最小排列字符错误率

Abstract: This paper describes a speaker-attributed automatic speech recognition (SA-ASR) system submitted to the multi-channel multi-party meeting transcription challenge, which aims to address the “who spoke what” problem. We align the serialized output training-based multi-speaker ASR hypotheses and speaker diarization (SD) results to obtain speaker-attributed transcriptions. We use a pre-trained multi-frame cross-channel attention (MFCCA) model as the ASR module. We build a cascade system which includes a pre-trained speaker overlapaware neural diarization and target-speaker voice activity detection model as the SD module. Decoding and alignment strategies are further used to improve the SA-ASR performance. Our proposed system outperforms the baseline with a relative improvement of 40.3% in terms of concatenated minimum-permutation character error rate on the AliMeeting dataset, which ranks top-3 on the fixed sub-track.

中图分类号: