Journal of Shanghai Jiaotong University(Science) >
DSNet: Disentangled Siamese Network with Neutral Calibration for Speech Emotion Recognition
Received date: 2023-12-19
Accepted date: 2024-01-05
Online published: 2024-04-22
Chen Chengxin, Zhang Pengyuan . DSNet: Disentangled Siamese Network with Neutral Calibration for Speech Emotion Recognition[J]. Journal of Shanghai Jiaotong University(Science), 2026 , 31(2) : 248 -257 . DOI: 10.1007/s12204-024-2724-1
1. SCHULLER B W. Speech emotion recognition [J]. Communications of the ACM, 2018, 61(5): 90-99.
2. WENINGER F, EYBEN F, SCHULLER B W, et al. On the acoustics of emotion in audio: What speech, music, and sound have in common [J]. Frontiers in Psychology, 2013, 4: 292.
3. EYBEN F, SCHERER K R, SCHULLER B W, et al. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing [J]. IEEE Transactions on Affective Computing, 2016, 7(2): 190-202.
4. WAGNER J, TRIANTAFYLLOPOULOS A, WIERSTORF H, et al. Dawn of the transformer era in speech emotion recognition: Closing the valence gap [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(9): 10745-10759.
5. YE J X, WEN X C, WEI Y J, et al. Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition [C]// 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island: IEEE, 2023: 1-5.
6. SHEN S Y, LIU F, ZHOU A M. Mingling or misalignment? temporal shift for speech emotion recognition with pre-trained representations [C]// 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island: IEEE, 2023: 1-5.
7. CHEN C X, ZHANG P Y. CTA-RNN: Channel and temporal-wise attention RNN leveraging pre-trained ASR embeddings for speech emotion recognition [C]//Interspeech 2022. Incheon: ISCA, 2022: 4730-4734.
8. ABDELWAHAB M, BUSSO C. Domain adversarial for acoustic emotion recognition [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(12): 2423-2435.
9. LUO H, HAN J Q. Nonnegative matrix factorization based transfer subspace learning for cross-corpus speech emotion recognition [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 2047-2060.
10. ZHANG S Q, LIU R X, YANG Y J, et al. Unsupervised domain adaptation integrating transformer and mutual information for cross-corpus speech emotion recognition [C]// 30th ACM International Conference on Multimedia. Lisboa: ACM, 2022: 120-129.
11. SETHU V, AMBIKAIRAJAH E, EPPS J. Speaker normalisation for speech-based emotion detection [C]//2007 15th International Conference on Digital Signal Processing. Cardiff: IEEE, 2007: 611-614.
12. BUSSO C, METALLINOU A, NARAYANAN S S. Iterative feature normalization for emotional speech detection [C]//2011 IEEE International Conference on Acoustics, Speech and Signal Processing. Prague: IEEE, 2011: 5692-5695.
13. DANG T, SETHU V, AMBIKAIRAJAH E. Factor analysis based speaker normalisation for continuous emotion prediction [C]//Interspeech 2016. San Francisco: ISCA, 2016: 913-917.
14. GORROSTIETA C, LOTFIAN R, TAYLOR K, et al. Gender de-biasing in speech emotion recognition [C]//Interspeech 2019. Graz: ISCA, 2019: 2823-2827.
15. LI H Q, TU M, HUANG J, et al. Speaker-invariant affective representation learning via adversarial training [C]// 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 7144-7148.
16. YIN Y F, HUANG B Y, WU Y Z, et al. Speaker-invariant adversarial domain adaptation for emotion recognition [C]// 2020 International Conference on Multimodal Interaction. Online: ACM, 2020: 481-490.
17. GAT I, ARONOWITZ H, ZHU W Z, et al. Speaker normalization for self-supervised speech emotion recognition [C]// 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 7342-7346.
18. TRIANTAFYLLOPOULOS A, LIU S, SCHULLER B W. Deep speaker conditioning for speech emotion recognition [C]//2021 IEEE International Conference on Multimedia and Expo. Shenzhen: IEEE, 2021: 1-6.
19. FAN W Q, XU X M, CAI B L, et al. ISNet: Individual standardization network for speech emotion recognition [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 1803-1814.
20. WANG X, CHEN H, TANG S A, et al. Disentangled representation learning [DB/OL]. (2022-11-21). https://arxiv.org/abs/2211.11695
21. KINGMA D P, WELLING M. Auto-encoding variational Bayes [DB/OL]. (2013-12-20). http://arxiv.org/abs/1312.6114
22. GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversar- ial nets [C]// 28th Conference on Neural Information Processing Systems. Montreal: NIPS, 2014: 2672-2680.
23. HSU W N, ZHANG Y, WEISS R J, et al. Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization [C]// 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 5901-5905.
24. LIAN J C, ZHANG C L, YU D. Robust disentangled variational speech representation learning for zero-shot voice conversion [C]// 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6572-6576.
25. TRINH V A, BRAUN S. Unsupervised speech enhancement with speech recognition embedding and disentanglement losses [C]// 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 391-395.
26. CHICCO D. Siamese neural networks: An overview [M]// Artificial neural networks. New York: Humana Press, 2021: 73-94.
27. DEY S, DUTTA A, TOLEDO J I, et al. SigNet: Convolutional Siamese network for writer independent offline signature verification [DB/OL]. (2017-07-07). http://arxiv.org/abs/1707.02131
28. CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations [C]// 37th International Conference on Machine Learning. Vienna: IMLS, 2020: 1597-1607.
29. LIAN Z, LI Y, TAO J H, et al. Speech emotion recognition via contrastive loss under Siamese networks [C]// Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and First Multi-Modal Affective Computing of Large-Scale Multimedia Data. Seoul: ACM, 2018: 21-26.
30. HAJAVI A, ETEMAD A. Siamese capsule network for end-to-end speaker recognition in the wild [C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 7203-7207.
31. MIRSAMADI S, BARSOUM E, ZHANG C. Automatic speech emotion recognition using recurrent neural networks with local attention [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 2227-2231.
32. BOUSMALIS K, TRIGEORGIS G, SILBERMAN N, et al. Domain separation networks [C]// 30th Conference on Neural Information Processing Systems. Barcelona: NIPS, 2016: 1-9.
33. BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: Interactive emotional dyadic motion capture database [J]. Language Resources and Evaluation, 2008, 42(4): 335-359.
34. BUSSO C, PARTHASARATHY S, BURMANIA A, et al. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception [J]. IEEE Transactions on Affective Computing, 2017, 8(1): 67-80.
35. LU C, ZONG Y, ZHENG W M, et al. Domain invariant feature learning for speaker-independent speech emotion recognition [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 2217-2230.
36. VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE [J]. Journal of Machine Learning Research, 2008, 9(11): 2579-2605.
/
| 〈 |
|
〉 |