Journal of Shanghai Jiao Tong University (Science) ›› 2020, Vol. 25 ›› Issue (1): 70-75.doi: 10.1007/s12204-019-2147-6

• • 上一篇    下一篇

Joint CTC-Attention End-to-End Speech Recognition with a Triangle Recurrent Neural Network Encoder

ZHU Tao (朱涛), CHENG Chunling¤ (程春玲)   

  1. (School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China)
  • 出版日期:2020-01-15 发布日期:2020-01-12
  • 通讯作者: CHENG Chunling (程春玲) E-mail: chengcl@njupt.edu.cn

Joint CTC-Attention End-to-End Speech Recognition with a Triangle Recurrent Neural Network Encoder

ZHU Tao (朱涛), CHENG Chunling¤ (程春玲)   

  1. (School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China)
  • Online:2020-01-15 Published:2020-01-12
  • Contact: CHENG Chunling (程春玲) E-mail: chengcl@njupt.edu.cn

摘要: Traditional speech recognition model based on deep neural network (DNN) and hidden Markov model (HMM) is a complex and multi-module system. In other words, optimization goals may di?er between modules in traditional model. Besides, additional language resources are required, such as pronunciation dictionary and language model. To eliminate the drawbacks of traditional model, we hereby propose an end-to-end speech recog- nition method, where connectionist temporal classiˉcation (CTC) and attention are integrated for decoding. In our model, the complex modules are replaced by a single deep network. Our model mainly consists of encoder and decoder. The encoder is constructed by bidirectional long short-term memory (BLSTM) with a triangular struc- ture for feature extraction. The decoder based on CTC-attention decoding utilizes advanced features extracted by shared encoder for training and decoding. The experimental results on the VoxForge dataset indicate that end-to-end method is superior to basic CTC and attention-based encoder-decoder decoding, and the character error rate (CER) is reduced to 12.9% without using any language model.

关键词: end-to-end, connectionist temporal classification (CTC), attention, speech recognition

Abstract: Traditional speech recognition model based on deep neural network (DNN) and hidden Markov model (HMM) is a complex and multi-module system. In other words, optimization goals may di?er between modules in traditional model. Besides, additional language resources are required, such as pronunciation dictionary and language model. To eliminate the drawbacks of traditional model, we hereby propose an end-to-end speech recog- nition method, where connectionist temporal classiˉcation (CTC) and attention are integrated for decoding. In our model, the complex modules are replaced by a single deep network. Our model mainly consists of encoder and decoder. The encoder is constructed by bidirectional long short-term memory (BLSTM) with a triangular struc- ture for feature extraction. The decoder based on CTC-attention decoding utilizes advanced features extracted by shared encoder for training and decoding. The experimental results on the VoxForge dataset indicate that end-to-end method is superior to basic CTC and attention-based encoder-decoder decoding, and the character error rate (CER) is reduced to 12.9% without using any language model.

Key words: end-to-end, connectionist temporal classification (CTC), attention, speech recognition

中图分类号: