Simultaneous Speech Extraction for Multiple Target Speakers Under Meeting Scenarios

doi:10.1007/s12204-024-2739-7

Abstract

Abstract: The common target speech separation directly estimates the target source, ignoring the interrelationship between different speakers at each frame. We propose a multiple-target speech separation (MTSS) model to simultaneously extract each speaker’s voice from the mixed speech rather than just optimally estimating the target source. Moreover, we propose a speaker diarization (SD) aware MTSS system (SD-MTSS). By exploiting the target speaker voice activity detection (TSVAD) and the estimated mask, our SD-MTSS model can extract the speech signal of each speaker concurrently in a conversational recording without additional enrollment audio in advance. Experimental results show that our MTSS model achieves improvements of 1.38 dB signal-to-distortion ratio (SDR), 1.34 dB scale-invariant signal-to-distortion ratio (SISDR), and 0.13 perceptual evaluation of speech quality (PESQ) over the baseline on the WSJ0-2mix-extr dataset, separately. The SD-MTSS system makes a 19.2% relative speaker dependent character error rate reduction on the AliMeeting dataset.

Key words: target speech separation, interrelationship, speaker diarization (SD), target speaker voice activity detection, multiple-target speech separation (MTSS) model

摘要： 传统的目标语音分离技术直接估计目标说话人语音成分，忽略了每帧上不同说话人之间的相互关系。我们提出了一种多目标说话人语音分离（MTSS）模型，可以同时从混合语音中提取每个说话人的语音成分，而不仅仅是最优估计得到单个目标说话人语音成分。此外，还提出了一种基于说话人日志技术（SD）的多目标说话人语音分离系统（SD-MTSS）。通过利用目标说话人声活动检测（TSVAD）和估计的掩模，SD-MTSS模型可以在会话录音中同时提取每个说话人的语音成分，无需提前注册目标说话人语音。实验结果表明，提出的MTSS模型在WSJ0-2mix-extr数据集上分别实现了1.38 dB的信号失真比（SDR）、1.34 dB的尺度不变信号失真比（SI-SDR）和0.13的语音质量感知评估（PESQ）改进。SD-MTSS系统在Alimeeting数据集上实现了19.2%的说话人相关的字符错误率降低。

关键词: 目标说话人语音分离, 相互关系, 说话人日志技术, 特定人语音活动监测, 多目标说话人语音分离模型

CLC Number:

TN912.34

Zeng Bang, Suo Hongbin, Wan Yulong, Li Ming. Simultaneous Speech Extraction for Multiple Target Speakers Under Meeting Scenarios[J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 258-264.

References

[1] HERSHEY J R, CHEN Z, LE ROUX J, et al. Deep clustering: Discriminative embeddings for segmentation and separation [C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai: IEEE, 2016: 31-35.
[2] CHEN Z, LUO Y, MESGARANI N. Deep attractor network for single-microphone speaker separation [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 246-250.
[3] YU D, KOLBÆK M, TAN Z H, et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 241-245.
[4] KOLBÆK M, YU D, TAN Z H, et al. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(10): 1901-1913.
[5] LUO Y, MESGARANI N. TaSNet: Time-domain audio separation network for real-time, single-channel speech separation [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 696-700.
[6] LUO Y, MESGARANI N. Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1256-1266.
[7] LUO Y, CHEN Z, YOSHIOKA T. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation[C]// 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 46-50.
[8] GE M, XU C L, WANG L B, et al. Multi-stage speaker extraction with utterance and frame-level reference signals [C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6109-6113.
[9] DELCROIX M, ZMOLIKOVA K, OCHIAI T, et al. Speaker activity driven neural speech extraction [C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6099-6103.
[10] WANG Q, MUCKENHIRN H, WILSON K, et al. VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking [C]//Interspeech 2019. ISCA: Graz, 2019: 2728-2732.
[11] LI T L, LIN Q J, BAO Y Y, et al. Atss-net: Target speaker separation via attention-based neural network [C]//Interspeech 2020. Shanghai: ISCA, 2020: 1411-1415.
[12] CHEN J, RAO W, WANG Z L, et al. MC-SpEx: Towards effective speaker extraction with multi-scale interfusion and conditional speaker modulation [C]//Interspeech 2023. Dublin: ISCA, 2023: 4034-4038.
[13] WANG Q, DOWNEY C, WAN L, et al. Speaker diarization with LSTM [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5239-5243.
[14] WANG W Q, QIN X Y, LI M. Cross-channel attention-based target speaker voice activity detection: Experimental results for the M2met challenge [C]// 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 9171-9175.
[15] YU F, ZHANG S, FU Y, et al. M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge [C]// 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6167-6171.
[16] DING S J, WANG Q, CHANG S Y, et al. Personal VAD: Speaker-conditioned voice activity detection [C]//The Speaker and Language Recognition Workshop (Odyssey 2020). Tokyo: ISCA, 2020: 433-439.
[17] GE M, XU C L, WANG L B, et al. SpEx+: A complete time domain speaker extraction network [C]//Interspeech 2020. Shanghai: ISCA, 2020: 1406-1410.
[18] WANG W Q, LI M, LIN Q J. Online target speaker voice activity detection for speaker diarization [C]//Interspeech 2022. Incheon: ISCA, 2022: 1441-1445.
[19] LIN Q J, YIN R Q, LI M, et al. LSTM based similarity measurement with spectral clustering for speaker diarization [C]//Interspeech 2019. Graz: ISCA, 2019: 366-370.
[20] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[21] DENG J K, GUO J, XUE N N, et al. ArcFace: Additive angular margin loss for deep face recognition [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 4685-4694.
[22] COSENTINO J, PARIENTE M, CORNELL S, et al. LibriMix: An open-source dataset for generalizable speech separation [DB/OL]. (2020-05-22). http://arxiv.org/abs/2005.11262
[23] LE ROUX J, WISDOM S, ERDOGAN H, et al. SDR–half-baked or well done? [C]// 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 626-630.
[24] WANG W Q, CAI D W, LIN Q J, et al. The DKU-DukeECE-lenovo system for the diarization task of the 2021 VoxCeleb speaker recognition challenge [DB/OL]. (2021-09-05). http://arxiv.org/abs/2109.02002
[25] YU F, DU Z H, ZHANG S L, et al. A comparative study on speaker-attributed automatic speech recognition in multi-party meetings [C]//Interspeech 2022. Incheon: ISCA, 2022: 560-564.
[26] DELCROIX M, ZMOLIKOVA K, KINOSHITA K, et al. Single channel target speaker extraction and recognition with speaker beam [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5554-5558.
[27] XU C L, RAO W, CHNG E S, et al. Optimization of speaker extraction neural network with magnitude and temporal spectrum approximation loss [C]// 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 6990-6994.
[28] XU C L, RAO W, CHNG E S, et al. Time-domain speaker extraction network [C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop. Singapore: IEEE, 2019: 327-334.
[29] XU C L, RAO W, CHNG E S, et al. SpEx: Multi-scale time domain speaker extraction network [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1370-1384.
[30] YAO Z Y, WU D, WANG X, et al. WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit [C]//Interspeech 2021. Brno: ISCA, 2021: 4054-4058.
[31] ZHANG B B, LV H, GUO P C, et al. WENETSPEECH: A 10000 hours multi-domain mandarin corpus for speech recognition [C]// 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6182-6186.