[1] HERSHEY J R, CHEN Z, LE ROUX J, et al. Deep clustering: Discriminative embeddings for segmentation and separation [C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai: IEEE, 2016: 31-35.
[2] CHEN Z, LUO Y, MESGARANI N. Deep attractor network for single-microphone speaker separation [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 246-250.
[3] YU D, KOLBÆK M, TAN Z H, et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 241-245.
[4] KOLBÆK M, YU D, TAN Z H, et al. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(10): 1901-1913.
[5] LUO Y, MESGARANI N. TaSNet: Time-domain audio separation network for real-time, single-channel speech separation [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 696-700.
[6] LUO Y, MESGARANI N. Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1256-1266.
[7] LUO Y, CHEN Z, YOSHIOKA T. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation[C]// 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 46-50.
[8] GE M, XU C L, WANG L B, et al. Multi-stage speaker extraction with utterance and frame-level reference signals [C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6109-6113.
[9] DELCROIX M, ZMOLIKOVA K, OCHIAI T, et al. Speaker activity driven neural speech extraction [C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6099-6103.
[10] WANG Q, MUCKENHIRN H, WILSON K, et al. VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking [C]//Interspeech 2019. ISCA: Graz, 2019: 2728-2732.
[11] LI T L, LIN Q J, BAO Y Y, et al. Atss-net: Target speaker separation via attention-based neural network [C]//Interspeech 2020. Shanghai: ISCA, 2020: 1411-1415.
[12] CHEN J, RAO W, WANG Z L, et al. MC-SpEx: Towards effective speaker extraction with multi-scale interfusion and conditional speaker modulation [C]//Interspeech 2023. Dublin: ISCA, 2023: 4034-4038.
[13] WANG Q, DOWNEY C, WAN L, et al. Speaker diarization with LSTM [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5239-5243.
[14] WANG W Q, QIN X Y, LI M. Cross-channel attention-based target speaker voice activity detection: Experimental results for the M2met challenge [C]// 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 9171-9175.
[15] YU F, ZHANG S, FU Y, et al. M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge [C]// 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6167-6171.
[16] DING S J, WANG Q, CHANG S Y, et al. Personal VAD: Speaker-conditioned voice activity detection [C]//The Speaker and Language Recognition Workshop (Odyssey 2020). Tokyo: ISCA, 2020: 433-439.
[17] GE M, XU C L, WANG L B, et al. SpEx+: A complete time domain speaker extraction network [C]//Interspeech 2020. Shanghai: ISCA, 2020: 1406-1410.
[18] WANG W Q, LI M, LIN Q J. Online target speaker voice activity detection for speaker diarization [C]//Interspeech 2022. Incheon: ISCA, 2022: 1441-1445.
[19] LIN Q J, YIN R Q, LI M, et al. LSTM based similarity measurement with spectral clustering for speaker diarization [C]//Interspeech 2019. Graz: ISCA, 2019: 366-370.
[20] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[21] DENG J K, GUO J, XUE N N, et al. ArcFace: Additive angular margin loss for deep face recognition [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 4685-4694.
[22] COSENTINO J, PARIENTE M, CORNELL S, et al. LibriMix: An open-source dataset for generalizable speech separation [DB/OL]. (2020-05-22). http://arxiv.org/abs/2005.11262
[23] LE ROUX J, WISDOM S, ERDOGAN H, et al. SDR–half-baked or well done? [C]// 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 626-630.
[24] WANG W Q, CAI D W, LIN Q J, et al. The DKU-DukeECE-lenovo system for the diarization task of the 2021 VoxCeleb speaker recognition challenge [DB/OL]. (2021-09-05). http://arxiv.org/abs/2109.02002
[25] YU F, DU Z H, ZHANG S L, et al. A comparative study on speaker-attributed automatic speech recognition in multi-party meetings [C]//Interspeech 2022. Incheon: ISCA, 2022: 560-564.
[26] DELCROIX M, ZMOLIKOVA K, KINOSHITA K, et al. Single channel target speaker extraction and recognition with speaker beam [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5554-5558.
[27] XU C L, RAO W, CHNG E S, et al. Optimization of speaker extraction neural network with magnitude and temporal spectrum approximation loss [C]// 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 6990-6994.
[28] XU C L, RAO W, CHNG E S, et al. Time-domain speaker extraction network [C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop. Singapore: IEEE, 2019: 327-334.
[29] XU C L, RAO W, CHNG E S, et al. SpEx: Multi-scale time domain speaker extraction network [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1370-1384.
[30] YAO Z Y, WU D, WANG X, et al. WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit [C]//Interspeech 2021. Brno: ISCA, 2021: 4054-4058.
[31] ZHANG B B, LV H, GUO P C, et al. WENETSPEECH: A 10000 hours multi-domain mandarin corpus for speech recognition [C]// 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6182-6186.