[1] SNYDER D, GARCIA-ROMERO D, SELL G, et al. X-vectors: Robust DNN embeddings for speaker recognition [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5329-5333.
[2] SNYDER D, GARCIA-ROMERO D, SELL G, et al. Speaker recognition for multi-speaker conversations using X-vectors [C]// 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 5796-5800.
[3] ZEINALI H, WANG S, SILNOVA A, et al. BUT system description to VoxCeleb speaker recognition challenge 2019 [DB/OL]. (2019-10-16). http://arxiv.org/abs/1910.12592
[4] PEDDINTI V, POVEY D, KHUDANPUR S. A time delay neural network architecture for efficient modeling of long temporal contexts [C]//Interspeech 2015. Dresden: ISCA, 2015: 3214-3218.
[5] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[6] KENNY P. Bayesian speaker verification with heavy-tailed priors [C]// Odyssey Speaker and Language Recognition Workshop. Brno: ISCA, 2010.
[7] DEHAK N, KENNY P J, DEHAK R, et al. Front-end factor analysis for speaker verification [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 19(4): 788-798.
[8] VARIANI E, LEI X, MCDERMOTT E, et al. Deep neural networks for small footprint text-dependent speaker verification [C]//2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Florence: IEEE, 2014: 4052-4056.
[9] DESPLANQUES B, THIENPONDT J, DEMUYNCK K. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification [C]//Interspeech 2020. Shanghai: ISCA, 2020: 3830-3834.
[10] ZHANG Y J, WANG Y W, CHEN C P, et al. Improving time delay neural network based speaker recognition with convolutional block and feature aggregation methods [C]//Interspeech 2021. Brno: ISCA, 2021: 76-80.
[11] HU J, SHEN L, ALBANIE S, et al. Squeeze-and-excitation networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011-2023.
[12] WANG Q L, WU B G, ZHU P F, et al. ECA-net: Efficient channel attention for deep convolutional neural networks [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11531-11539.
[13] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module [M]//Computer vision – ECCV 2018. Cham: Springer, 2018: 3-19.
[14] ZHANG Q L, YANG Y B. SA-net: Shuffle attention for deep convolutional neural networks [C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 2235-2239.
[15] HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 13708-13717.
[16] WEI Y H, DU J Z, LIU H, et al. CTFALite: Lightweight channel-specific temporal and frequency attention mechanism for enhancing the speaker embedding extractor [C]//Interspeech 2022. Incheon: ISCA, 2022: 341-345.
[17] CHUNG J S, NAGRANI A, ZISSERMAN A. VoxCeleb2: Deep speaker recognition [C]//Interspeech 2018. Hyderabad: ISCA, 2018: 1086-1090.
[18] SNYDER D, CHEN G G, POVEY D. MUSAN: A music, speech, and noise corpus [DB/OL]. (2015-10-28). http://arxiv.org/abs/1510.08484
[19] KO T, PEDDINTI V, POVEY D, et al. A study on data augmentation of reverberant speech for robust speech recognition [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 5220-5224.
[20] PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition [C]//Interspeech 2019. Graz: ISCA, 2019: 2613-2617.
[21] PASZKE A, GROSS S, MASSA F, et al. PyTorch: An imperative style, high-performance deep learning library [C]// 33rd Conference on Neural Information Processing Systems. Vancouver: NIPS, 2019: 1-12.
[22] GAO S H, CHENG M M, ZHAO K, et al. Res2Net: A new multi-scale backbone architecture [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(2): 652-662.
[23] KINGMA D P, BA J. Adam: A method for stochastic optimization [DB/OL]. (2014-12-22). http://arxiv.org/abs/1412.6980
[24] LOSHCHILOV I, HUTTER F. SGDR: Stochastic gradient descent with warm restarts [DB/OL]. (2016-08-13). http://arxiv.org/abs/1608.03983
[25] DENG J K, GUO J, YANG J, et al. ArcFace: Additive angular margin loss for deep face recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(10): 5962-5979.
[26] NAGRANI A, CHUNG J S, ZISSERMAN A. VoxCeleb: A large-scale speaker identification dataset [C]//Interspeech 2017. Stockholm: ISCA, 2017: 2616-2620.
[27] NAGRANI A, CHUNG J S, XIE W D, et al. Voxceleb: Large-scale speaker verification in the wild [J]. Computer Speech & Language, 2020, 60: 101027.
[28] BROWN A, HUH J, CHUNG J S, et al. VoxSRC 2021: The third VoxCeleb speaker recognition challenge [DB/OL]. (2022-01-12). http://arxiv.org/abs/2201.04583
[29] FAN Y, KANG J W, LI L T, et al. CN-celeb: A challenging Chinese speaker recognition dataset [C]// 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 7604-7608.
[30] CUMANI S, BATZU P D, COLIBRO D, et al. Comparison of speaker recognition approaches for real applications [C]//Interspeech 2011. Florence: ISCA, 2011: 2365-2368.
|