Enhancing Speech Recognition for Parkinson’s Disease Patient Using Transfer Learning Technique

doi:10.1007/s12204-021-2376-3

摘要/Abstract

Abstract: Parkinson’s disease patients suffer from disorders of speech. The most frequently reported speech problems are weak, hoarse, nasal or monotonous voice, imprecise articulation, slow or fast speech, difficulty starting speech, impaired stress or rhythm, stuttering, and tremor. To improve the speech quality and assist the patient with speech rehabilitation therapy, we have proposed the speech recognition model for Parkinson’s disease patients using transfer learning technique (PSTL), where we have pre-trained the long short-term memory (LSTM) neural network model with our developed publicly available dataset that has been obtained from healthy people through the social media platform. Then, we applied the transfer learning technique to improve the performance of the PSTL framework. The frequency spectrogram masking data augmentation method has been used to alleviate the over-fitting problem so that the word error rate (WER) is further reduced. Even with a limited dataset, our proposed model has effectively reduced the WER from 58% to 44.5% on the original speech dataset and 53.1% to 43% on the denoised speech dataset, which demonstrated the feasibility of our framework.

Key words: speech recognition, parkinson’s disease, transfer learning technique, data augmentation, scarce data

中图分类号:

. [J]. J Shanghai Jiaotong Univ Sci, 2022, 27(1): 90-98.

YU Qing (余青), MA Yi (马祎), LI Yongfu∗ (李永福). Enhancing Speech Recognition for Parkinson’s Disease Patient Using Transfer Learning Technique[J]. J Shanghai Jiaotong Univ Sci, 2022, 27(1): 90-98.

参考文献 30

[1]	TRAN J, ANASTACIO H, BARDY C. Genetic predispositionsof Parkinson’s disease revealed in patientderivedbrain cells [J]. Npj Parkinson’s Disease, 2020,6: 8.
[2]	DASHTIPOUR K, TAFRESHI A, LEE J, et al. Speechdisorders in Parkinson’s disease: Pathophysiology,medical management and surgical approaches [J]. NeurodegenerativeDisease Management, 2018, 8(5): 337-348.
[3]	HO A K, IANSEK R, MARIGLIANI C, et al. Speechimpairment in a large sample of patients with Parkinson’sdisease [J]. Behavioural Neurology, 1998, 11(3):131-137.
[4]	ESPA?NA-BONET C, FONOLLOSA J A R. Automaticspeech recognition with deep neural networksfor impaired speech [M]//Advances in speech andlanguage technologies for Iberian languages. Cham:Springer, 2016: 97-107.
[5]	Y?LMAZ E, GANZEBOOM M, CUCCHIARINI C, etal. Multi-stage DNN training for automatic recognitionof dysarthric speech [C]//Interspeech 2017. Stockholm:ISCA, 2017: 2685-2689.
[6]	KONS Z, SHECHTMAN S, SORIN A, et al. NeuralTTS voice conversion [C]//2018 IEEE Spoken LanguageTechnology Workshop (SLT). Athens: IEEE,2018: 290-296.
[7]	MORO-VELAZQUEZ L, CHO J, WATANABE S, etal. Study of the performance of automatic speechrecognition systems in speakers with Parkinson’s disease[C]//Interspeech 2019. Graz: ISCA, 2019: 3875-3879.
[8]	PANAYOTOV V, CHEN G G, POVEY D, et al. Librispeech:An ASR corpus based on public domain audiobooks [C]//2015 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP).South Brisbane: IEEE, 2015: 5206-5210.
[9]	RUSZ J, CMEJLA R, RUZICKOVA H, et al. Quantitativeacoustic measurements for characterization ofspeech and voice disorders in early untreated Parkinson’sdisease [J]. The Journal of the Acoustical Societyof America, 2011, 129(1): 350-367.
[10]	BAYESTEHTASHK A, ASGARI M, SHAFRAN I,et al. Fully automated assessment of the severity ofParkinson’s disease from speech [J]. Computer Speech& Language, 2015, 29(1): 172-185.
[11]	OROZCO-ARROYAVE J R, ARIAS-LONDO?NO J D,VARGAS-BONILLA J F, et al. New Spanish speechcorpus database for the analysis of people sufferingfrom Parkinson’s disease [C]//International Conferenceon Language Resources & Evaluation. Reykjavik:ELRA, 2014: 342-347.
[12]	MORO-VELAZQUEZ L, GOMEZ-GARCIA J A,GODINO-LLORENTE J I, et al. A forced Gaussiansbased methodology for the differential evaluation ofParkinson’s Disease by means of speech processing [J].Biomedical Signal Processing and Control, 2019, 48:205-220.
[13]	Adobe. Adobe Audition CC Help [M]. San Jose: AdobeInc., 2018.
[14]	RIX A W, BEERENDS J G, HOLLIER M P, et al.Perceptual evaluation of speech quality (PESQ)-a newmethod for speech quality assessment of telephone networksand codecs [C]//2001 IEEE International Conferenceon Acoustics, Speech, and Signal Processing.Salt Lake City, UT: IEEE, 2001: 749-752.
[15]	TAAL C H, HENDRIKS R C, HEUSDENS R,et al. An algorithm for intelligibility predictionof time-frequency weighted noisy speech [J]. IEEE Transactions on Audio, Speech, and Language Processing,2011, 19(7): 2125-2136.
[16]	READ J, MAZZONE E, HORTON M. Recognitionerrors and recognizing errors - children writingon the tablet PC [C]//Human-Computer Interaction-INTERACT 2005. Rome: IFIP TC13, 2005: 1096-1099.
[17]	PARK D S, CHAN W, ZHANG Y, et al. SpecAugment:A simple data augmentation method for automaticspeech recognition [C]//Interspeech 2019. Graz:ISCA, 2019: 2613-2617.
[18]	FLANAGAN J L. Speech synthesis [M]//Speech analysissynthesis and perception. Berlin, Heidelberg:Springer, 1965: 166-209.
[19]	AMODEI D, ANANTHANARAYANAN S, ANUBHAIR, et al. Deep speech 2: End-to-end speech recognitionin english and mandarin [C]// 33rd InternationalConference on Machine Learning. New York:JMLR, 2016: 173-182.
[20]	ZHENG F, ZHANG G L, SONG Z J. Comparisonof different implementations of MFCC [J]. Journal ofComputer Science and Technology, 2001, 16(6): 582-589.
[21]	ZHAO X J, WANG D L. Analyzing noise robustnessof MFCC and GFCC features in speaker identification[C]//2013 IEEE International Conference on Acoustics,Speech and Signal Processing. Vancouver, BC:IEEE, 2013: 7204-7208.
[22]	JIANG H. Feature extraction and dimensionality reductionin pattern recognition with applications inspeech recognition [D]. Singapore: Nanyang TechnologicalUniversity, 2006.
[23]	ZHANG C, WOODLAND P C. DNN speaker adaptationusing parameterised sigmoid and ReLU hiddenactivation functions [C]//2016 IEEE InternationalConference on Acoustics, Speech and Signal Processing(ICASSP). Shanghai: IEEE, 2016: 5300-5304.
[24]	GERS F A, SCHMIDHUBER J, CUMMINS F. Learningto forget: Continual prediction with LSTM [J].Neural Computation, 2000, 12(10): 2451-2471.
[25]	GRAVES A, FERN′ANDEZ S, GOMEZ F, et al. Connectionisttemporal classification: Labelling unsegmentedsequence data with recurrent neural networks[C]//Proceedings of the 23rd international conferenceon Machine learning. Pittsburgh, PA: ACM Press,2006: 369-376.
[26]	HEAFIELD K, POUZYREVSKY I, CLARK J H, etal. Scalable modified Kneser-Ney language model estimation[C]//51st Annual Meeting of the Associationfor Computational Linguistics. Sofia: Association forComputational Linguistics, 2013: 690-696.
[27]	NASEER A, RANI M, NAZ S, et al. Refining Parkinson’sneurological disorder identification through deeptransfer learning [J]. Neural Computing and Applications,2020, 32(3): 839-854.
[28]	YOON H, LI J. A novel positive transfer learning approachfor telemonitoring of Parkinson’s disease [J].IEEE Transactions on Automation Science and Engineering,2019, 16(1): 180-191.
[29]	TORVI V G, BHATTACHARYA A,CHAKRABORTY S. Deep domain adaptationto predict freezing of gait in patients with Parkinson’sdisease [C]//2018 17th IEEE International Conferenceon Machine Learning and Applications (ICMLA).Orlando, FL: IEEE, 2018: 1001-1006.
[30]	PAN S J, YANG Q. A survey on transfer learning [J].IEEE Transactions on Knowledge and Data Engineering,2010, 22(10): 1345-1359.[31] CHEN Z X, LIN Y. Improving X-vector and PLDA fortext-dependent speaker verification [C]//Interspeech2020. Shanghai: ISCA, 2020: 726-730.