DSNet: Disentangled Siamese Network with Neutral Calibration for Speech Emotion Recognition

doi:10.1007/s12204-024-2724-1

Abstract

Abstract: One persistent challenge in deep learning based speech emotion recognition (SER) is the unconscious encoding of emotion-irrelevant factors (e.g., speaker or phonetic variability), which limits the generalization of SER in practical use. In this paper, we propose DSNet, a disentangled Siamese network with neutral calibration, to meet the demand for a more robust and explainable SER model. Specifically, we introduce an orthogonal feature disentanglement module to explicitly project the high-level representation into two distinct subspaces. Later, we propose a novel neutral calibration mechanism to encourage one subspace to capture sufficient emotion-irrelevant information. In this way, the other one can better isolate and emphasize the emotion-relevant information within speech signals. Experimental results on two popular benchmark datasets demonstrate the superiority of DSNet over various state-of-the-art methods for speaker-independent SER.

Key words: speech emotion recognition, disentangled representation learning, Siamese neural network

摘要： 基于深度学习的语音情感识别（SER）中，一个持续性的挑战是对情感无关因素（例如，说话者或语音变异）的无意识编码，这限制了SER在实际应用中的泛化能力。本文中，提出了DSNet，一个带有中性校准的解耦孪生网络，以满足对更强大且可解释的SER模型的需求。具体而言，引入了一个正交特征解耦模块，将高级表征显式地投影到两个不同的子空间中。随后，提出了一种新颖的中性校准机制，鼓励一个子空间捕捉足够的情感无关信息。通过这种方式，另一个子空间可以更好地隔离和强调语音信号中与情感相关的信息。两个流行的基准数据集上的实验结果显示，DSNet在与说话者无关的SER方面优于各种先进方法。

关键词: 语音情感识别，解耦表示学习，孪生神经网络

CLC Number:

TP183

Chen Chengxin, Zhang Pengyuan. DSNet: Disentangled Siamese Network with Neutral Calibration for Speech Emotion Recognition[J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 248-257.

References

1. SCHULLER B W. Speech emotion recognition [J]. Communications of the ACM, 2018, 61(5): 90-99.
2. WENINGER F, EYBEN F, SCHULLER B W, et al. On the acoustics of emotion in audio: What speech, music, and sound have in common [J]. Frontiers in Psychology, 2013, 4: 292.
3. EYBEN F, SCHERER K R, SCHULLER B W, et al. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing [J]. IEEE Transactions on Affective Computing, 2016, 7(2): 190-202.
4. WAGNER J, TRIANTAFYLLOPOULOS A, WIERSTORF H, et al. Dawn of the transformer era in speech emotion recognition: Closing the valence gap [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(9): 10745-10759.
5. YE J X, WEN X C, WEI Y J, et al. Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition [C]// 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island: IEEE, 2023: 1-5.
6. SHEN S Y, LIU F, ZHOU A M. Mingling or misalignment? temporal shift for speech emotion recognition with pre-trained representations [C]// 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island: IEEE, 2023: 1-5.
7. CHEN C X, ZHANG P Y. CTA-RNN: Channel and temporal-wise attention RNN leveraging pre-trained ASR embeddings for speech emotion recognition [C]//Interspeech 2022. Incheon: ISCA, 2022: 4730-4734.
8. ABDELWAHAB M, BUSSO C. Domain adversarial for acoustic emotion recognition [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(12): 2423-2435.
9. LUO H, HAN J Q. Nonnegative matrix factorization based transfer subspace learning for cross-corpus speech emotion recognition [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 2047-2060.
10. ZHANG S Q, LIU R X, YANG Y J, et al. Unsupervised domain adaptation integrating transformer and mutual information for cross-corpus speech emotion recognition [C]// 30th ACM International Conference on Multimedia. Lisboa: ACM, 2022: 120-129.
11. SETHU V, AMBIKAIRAJAH E, EPPS J. Speaker normalisation for speech-based emotion detection [C]//2007 15th International Conference on Digital Signal Processing. Cardiff: IEEE, 2007: 611-614.
12. BUSSO C, METALLINOU A, NARAYANAN S S. Iterative feature normalization for emotional speech detection [C]//2011 IEEE International Conference on Acoustics, Speech and Signal Processing. Prague: IEEE, 2011: 5692-5695.
13. DANG T, SETHU V, AMBIKAIRAJAH E. Factor analysis based speaker normalisation for continuous emotion prediction [C]//Interspeech 2016. San Francisco: ISCA, 2016: 913-917.
14. GORROSTIETA C, LOTFIAN R, TAYLOR K, et al. Gender de-biasing in speech emotion recognition [C]//Interspeech 2019. Graz: ISCA, 2019: 2823-2827.
15. LI H Q, TU M, HUANG J, et al. Speaker-invariant affective representation learning via adversarial training [C]// 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 7144-7148.
16. YIN Y F, HUANG B Y, WU Y Z, et al. Speaker-invariant adversarial domain adaptation for emotion recognition [C]// 2020 International Conference on Multimodal Interaction. Online: ACM, 2020: 481-490.
17. GAT I, ARONOWITZ H, ZHU W Z, et al. Speaker normalization for self-supervised speech emotion recognition [C]// 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 7342-7346.
18. TRIANTAFYLLOPOULOS A, LIU S, SCHULLER B W. Deep speaker conditioning for speech emotion recognition [C]//2021 IEEE International Conference on Multimedia and Expo. Shenzhen: IEEE, 2021: 1-6.
19. FAN W Q, XU X M, CAI B L, et al. ISNet: Individual standardization network for speech emotion recognition [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 1803-1814.
20. WANG X, CHEN H, TANG S A, et al. Disentangled representation learning [DB/OL]. (2022-11-21). https://arxiv.org/abs/2211.11695
21. KINGMA D P, WELLING M. Auto-encoding variational Bayes [DB/OL]. (2013-12-20). http://arxiv.org/abs/1312.6114
22. GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversar- ial nets [C]// 28th Conference on Neural Information Processing Systems. Montreal: NIPS, 2014: 2672-2680.
23. HSU W N, ZHANG Y, WEISS R J, et al. Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization [C]// 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 5901-5905.
24. LIAN J C, ZHANG C L, YU D. Robust disentangled variational speech representation learning for zero-shot voice conversion [C]// 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6572-6576.
25. TRINH V A, BRAUN S. Unsupervised speech enhancement with speech recognition embedding and disentanglement losses [C]// 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 391-395.
26. CHICCO D. Siamese neural networks: An overview [M]// Artificial neural networks. New York: Humana Press, 2021: 73-94.
27. DEY S, DUTTA A, TOLEDO J I, et al. SigNet: Convolutional Siamese network for writer independent offline signature verification [DB/OL]. (2017-07-07). http://arxiv.org/abs/1707.02131
28. CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations [C]// 37th International Conference on Machine Learning. Vienna: IMLS, 2020: 1597-1607.
29. LIAN Z, LI Y, TAO J H, et al. Speech emotion recognition via contrastive loss under Siamese networks [C]// Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and First Multi-Modal Affective Computing of Large-Scale Multimedia Data. Seoul: ACM, 2018: 21-26.
30. HAJAVI A, ETEMAD A. Siamese capsule network for end-to-end speaker recognition in the wild [C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 7203-7207.
31. MIRSAMADI S, BARSOUM E, ZHANG C. Automatic speech emotion recognition using recurrent neural networks with local attention [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 2227-2231.
32. BOUSMALIS K, TRIGEORGIS G, SILBERMAN N, et al. Domain separation networks [C]// 30th Conference on Neural Information Processing Systems. Barcelona: NIPS, 2016: 1-9.
33. BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: Interactive emotional dyadic motion capture database [J]. Language Resources and Evaluation, 2008, 42(4): 335-359.
34. BUSSO C, PARTHASARATHY S, BURMANIA A, et al. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception [J]. IEEE Transactions on Affective Computing, 2017, 8(1): 67-80.
35. LU C, ZONG Y, ZHENG W M, et al. Domain invariant feature learning for speaker-independent speech emotion recognition [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 2217-2230.
36. VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE [J]. Journal of Machine Learning Research, 2008, 9(11): 2579-2605.

[1]	Xiao Sujie, Hao Ruipeng, Cheng Gaofeng, Xu Xiaoyan, Li Ta. EC-BERT: A BERT Language Model with Error Correction for Mandarin Chinese Speech Recognition [J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 282-288.
[2]	WU Yalei, LI Jinghua, KONG Dehui, LI Qianxing, YIN Baocai. 3D Hand Pose Estimation Using Semantic Dynamic Hypergraph Convolutional Networks [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 855-865.
[3]	DONG Zhaoxian, YU Shuo, SHEN Yanming. Multi-Scale Dynamic Hypergraph Convolutional Network for Traffic Flow Forecasting [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 880-888.
[4]	Ma Jin, Ren Ze, Zhang Tongtong, Ding Ying, Lu Yilei, Peng Yinghong. Transformer-Based Contrastive Learning Method for Automated Sleep Stages Classification [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(4): 720-732.
[5]	Xiao Wenbo, Xiong Jiakai, Yu Lesheng, He Yinshui, Ma Guohong. Weld Defect Monitoring Based on Two-Stage Convolutional Neural Network [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 291-299.
[6]	KE Jing¹(柯晶), ZHU Junchao² (朱俊超), YANG Xin¹(杨鑫), ZHANG Haolin³ (张浩林), SUN Yuxiang¹(孙宇翔), WANG Jiayi¹(王嘉怡), LU Yizhou⁴(鲁亦舟), SHEN Yiqing⁵(沈逸卿), LIU Sheng⁶(刘晟), JIANG Fusong⁷(蒋伏松), HUANG Qin⁸(黄琴). TshFNA-Examiner: A Nuclei Segmentation and Cancer Assessment Framework for Thyroid Cytology Image [J]. J Shanghai Jiaotong Univ Sci, 2024, 29(6): 945-957.
[7]	LI Mingai^{1, 2∗} (李明爱), WEI Lina¹ (魏丽娜). Motor Imagery Classification Based on Plain Convolutional Neural Network and Linear Interpolation [J]. J Shanghai Jiaotong Univ Sci, 2024, 29(6): 958-966.
[8]	GENG Zongsheng¹ (耿宗盛), ZHAO Dongdong^1,2 (赵东东), ZHOU Xingwen¹ (周兴文), YAN Lei¹ (闫磊), YAN Shi^1,2∗ (阎石). Leader-Following Consensus of Multi-Agent Systems via Fully Distributed Event-Based Control [J]. J Shanghai Jiaotong Univ Sci, 2024, 29(4): 640-645.
[9]	ZHANG Yanjun(张彦军), WANG Biyun(王碧云)，CAI Yunze (蔡云泽). Multi-Channel Based on Attention Network for Infrared Small Target Detection [J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 414-427.
[10]	LIU Zengmin (刘增敏), WANG Shentao(王申涛), YAO Lixiu(姚莉秀), CAI Yunze(蔡云泽). Online Multi-Object Tracking Under Moving Unmanned Aerial Vehicle Platform Based on Object Detection and Feature Extraction Network [J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 388-399.
[11]	WANG Yujuan1 (王玉娟)，LI Wengang2 (李文刚)，LIU .Jianyong3 (刘建勇),CHEN Guangxue4 (陈广学),WANG Jun1*(汪军). Color Prediction Model of Gray Hybrid Multifilament Fabric [J]. J Shanghai Jiaotong Univ Sci, 2023, 28(6): 802-808.
[12]	LIU Zhuoran (刘卓然), ZHAO Xu∗ (赵旭). Multilevel Disparity Reconstruction Network for Real-Time Stereo Matching [J]. J Shanghai Jiaotong Univ Sci, 2022, 27(5): 715-722.
[13]	SU Chong∗ (宿翀), LÜ Jing (吕晶), ZHANG Danyang (张丹阳), LI Hongguang∗ (李宏光). Affective Preferences Mining Approach with Applications in Process Control [J]. J Shanghai Jiaotong Univ Sci, 2022, 27(5): 737-746.