J Shanghai Jiaotong Univ Sci ›› 2026, Vol. 31 ›› Issue (2): 248-257.doi: 10.1007/s12204-024-2724-1

Special Issue: 人机语音通讯

• Automation & Computer Technologies • Previous Articles     Next Articles

DSNet: Disentangled Siamese Network with Neutral Calibration for Speech Emotion Recognition

DSNet:用于语音情感识别的带有中性校准的解耦孪生网络

陈城鑫1, 2,张鹏远1, 2   

  1. 1. Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China; 2. College of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
  2. 1. 中国科学院 声学研究所 语言声学与内容理解重点实验室,北京 100190;2. 中国科学院大学 电子电气与通信工程学院,北京 100049
  • Received:2023-12-19 Accepted:2024-01-05 Online:2026-04-01 Published:2024-04-22

Abstract: One persistent challenge in deep learning based speech emotion recognition (SER) is the unconscious encoding of emotion-irrelevant factors (e.g., speaker or phonetic variability), which limits the generalization of SER in practical use. In this paper, we propose DSNet, a disentangled Siamese network with neutral calibration, to meet the demand for a more robust and explainable SER model. Specifically, we introduce an orthogonal feature disentanglement module to explicitly project the high-level representation into two distinct subspaces. Later, we propose a novel neutral calibration mechanism to encourage one subspace to capture sufficient emotion-irrelevant information. In this way, the other one can better isolate and emphasize the emotion-relevant information within speech signals. Experimental results on two popular benchmark datasets demonstrate the superiority of DSNet over various state-of-the-art methods for speaker-independent SER.

Key words: speech emotion recognition, disentangled representation learning, Siamese neural network

摘要: 基于深度学习的语音情感识别(SER)中,一个持续性的挑战是对情感无关因素(例如,说话者或语音变异)的无意识编码,这限制了SER在实际应用中的泛化能力。本文中,提出了DSNet,一个带有中性校准的解耦孪生网络,以满足对更强大且可解释的SER模型的需求。具体而言,引入了一个正交特征解耦模块,将高级表征显式地投影到两个不同的子空间中。随后,提出了一种新颖的中性校准机制,鼓励一个子空间捕捉足够的情感无关信息。通过这种方式,另一个子空间可以更好地隔离和强调语音信号中与情感相关的信息。两个流行的基准数据集上的实验结果显示,DSNet在与说话者无关的SER方面优于各种先进方法。

关键词: 语音情感识别,解耦表示学习,孪生神经网络

CLC Number: