J Shanghai Jiaotong Univ Sci ›› 2026, Vol. 31 ›› Issue (2): 258-264.doi: 10.1007/s12204-024-2739-7

Special Issue: 人机语音通讯

• Automation & Computer Technologies • Previous Articles     Next Articles

Simultaneous Speech Extraction for Multiple Target Speakers Under Meeting Scenarios

会议场景下多目标说话人的语音提取

曾邦1, 2, 索宏彬3, 万玉龙3, 李明 1,2   

  1. 1. School of Computer Science, Wuhan University, Wuhan 430027, China; 2. Suzhou Municipal Key Laboratory of Multimodal Intelligent Systems, Duke Kunshan University, Kunshan 215316, Jiangsu, China; 3. Data & AI Engineering System, OPPO, Beijing 100125, China
  2. 1. 武汉大学 计算机学院,武汉 430027; 2. 昆山杜克大学 苏州市多模态智能系统重点实验室,江苏 昆山 215316; 3. OPPO数据与AI工程系统,北京 100125
  • Received:2023-12-19 Accepted:2024-01-05 Online:2026-04-01 Published:2024-05-06

Abstract: The common target speech separation directly estimates the target source, ignoring the interrelationship between different speakers at each frame. We propose a multiple-target speech separation (MTSS) model to simultaneously extract each speaker’s voice from the mixed speech rather than just optimally estimating the target source. Moreover, we propose a speaker diarization (SD) aware MTSS system (SD-MTSS). By exploiting the target speaker voice activity detection (TSVAD) and the estimated mask, our SD-MTSS model can extract the speech signal of each speaker concurrently in a conversational recording without additional enrollment audio in advance. Experimental results show that our MTSS model achieves improvements of 1.38 dB signal-to-distortion ratio (SDR), 1.34 dB scale-invariant signal-to-distortion ratio (SISDR), and 0.13 perceptual evaluation of speech quality (PESQ) over the baseline on the WSJ0-2mix-extr dataset, separately. The SD-MTSS system makes a 19.2% relative speaker dependent character error rate reduction on the AliMeeting dataset.

Key words: target speech separation, interrelationship, speaker diarization (SD), target speaker voice activity detection, multiple-target speech separation (MTSS) model

摘要: 传统的目标语音分离技术直接估计目标说话人语音成分,忽略了每帧上不同说话人之间的相互关系。我们提出了一种多目标说话人语音分离(MTSS)模型,可以同时从混合语音中提取每个说话人的语音成分,而不仅仅是最优估计得到单个目标说话人语音成分。此外,还提出了一种基于说话人日志技术(SD)的多目标说话人语音分离系统(SD-MTSS)。通过利用目标说话人声活动检测(TSVAD)和估计的掩模,SD-MTSS模型可以在会话录音中同时提取每个说话人的语音成分,无需提前注册目标说话人语音。实验结果表明,提出的MTSS模型在WSJ0-2mix-extr数据集上分别实现了1.38 dB的信号失真比(SDR)、1.34 dB的尺度不变信号失真比(SI-SDR)和0.13的语音质量感知评估(PESQ)改进。SD-MTSS系统在Alimeeting数据集上实现了19.2%的说话人相关的字符错误率降低。

关键词: 目标说话人语音分离, 相互关系, 说话人日志技术, 特定人语音活动监测, 多目标说话人语音分离模型

CLC Number: