Exploring Generation of Pronunciation Lexicon for Low-Resource Language Automatic Speech Recognition Based on Generic Phone Recognizer

doi:10.1007/s12204-024-2730-3

Abstract

Abstract: The lexicon is an essential component in the hybrid automatic speech recognition (ASR) system. However, a high-quality lexicon requires significant efforts from the linguistic experts and is difficult to obtain, especially for low-resource languages. This paper addresses the problem of using a well-trained universal phone recognizer, obtained through the training of multilingual speech data and pronunciation lexicons, to generate pronunciation lexicons for low-resource languages driven by speech data. We propose a simple pipeline that utilizes this approach to generate pronunciation lexicons and apply them into ASR systems. The steps to generate the lexicon are simple and generic: applying the International Phonetic Alphabet (IPA) phone recognizer on the speech, then aligning it with the reference word sequence, followed by filtering to obtain a series of AUTO-subwords, using them to generate the AUTO-subword lexicon and the AUTO-IPA lexicon. We used the pronunciation lexicon generated for the hybrid system and for fine-tuning the pre-trained model. According to the experiment results, we are able to construct the lexicon without resourcing to linguistic experts. Furthermore, the generated lexicon is able to outperform grapheme-based lexicon and is comparable to expert lexicon.

Key words: International Phonetic Alphabet (IPA), lexicon learning, phone recognition, low-resource speech recognition

摘要： 发音词典是传统混合自动语音识别系统的重要组成部分。然而，高质量词典需要语言专家的精心标注，通常难以获得，特别是对于低资源语言。本文要解决的问题是，如何利用多语言语音数据和发音词典训练获得的通用音素识别器，通过语音数据驱动的方式为低资源语言生成发音词典。提出了一个简易的方案来生成发音词典，并将其应用到自动语音识别系统中。生成词典步骤是通用的：首先，在语音数据上使用国际音标（IPA）音素识别器，然后将音素识别结果与参考文本进行对齐，接着进行过滤以获得一系列子词，利用来生成AUTO-subword词典和AUTO-IPA词典。将生成的发音词典用于混合系统和微调预训练模型。实验结果表明，能够在无需语言专家资源的情况下构建词典，并应用到语音识别系统中。此外，生成词典的性能优于基于字素的词典，并可与专家词典相媲美。

关键词: 国际音标，发音词典学习，音素识别，低资源语音识别

CLC Number:

TN912.34

Li Jinpeng, Chen Xie, Zhang Weiqiang. Exploring Generation of Pronunciation Lexicon for Low-Resource Language Automatic Speech Recognition Based on Generic Phone Recognizer[J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 265-272.

References

1. RADFORD A, KIM J W, XU T, et al. Robust speech recognition via large-scale weak supervision [DB/OL]. (2022-12-06). http://arxiv.org/abs/2212.04356
2. BAEVSKI A, ZHOU Y, MOHAMED A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations [C]// 34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 12449-12460.
3. HSU W N, BOLTE B, TSAI Y H H, et al. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451-3460.
4. CHEN S Y, WANG C Y, CHEN Z Y, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing [J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1505-1518.
5. BAEVSKI A, HSU W N, XU Q T, et al. data2vec: A general framework for self-supervised learning in speech, vision and language [DB/OL]. (2022-02-07). http://arxiv.org/abs/2202.03555
6. YUAN J H, CAI X Y, GAO D J, et al. Decoupling recognition and transcription in mandarin ASR [C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop. Cartagena: IEEE, 2021: 1019-1025.
7. HARWATH D, GLASS J R. Speech recognition without a lexicon—Bridging the gap between graphemic and phonetic systems [C]//Interspeech 2014. Singapore: ISCA, 2014: 2655-2659.
8. GALES M J F, KNILL K M, RAGNI A. Unicode-based graphemic systems for limited resource languages [C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane: IEEE, 2015: 5186-5190.
9. LEE C, ZHANG Y, GLASS J. Joint learning of phonetic units and word pronunciations for ASR [C]// 2013 Conference on Empirical Methods in Natural Language Processing. Seattle: ACL, 2013: 182-192.
10. LEE C Y, O’DONNELL T J, GLASS J. Unsupervised lexicon discovery from acoustic input [J]. Transactions of the Association for Computational Linguistics, 2015, 3: 389-403.
11. AGENBAG W, NIESLER T. Improving automatically induced lexicons for highly agglutinating languages using data-driven morphological segmentation [C]//Interspeech 2019. Graz: ISCA, 2019: 3515-3519.
12. GOEL N, THOMAS S, AGARWAL M, et al. Approaches to automatic lexicon learning with limited training examples [C]//2010 IEEE International Conference on Acoustics, Speech and Signal Processing. Dallas: IEEE, 2010: 5094-5097.
13. CHEN G G, POVEY D, KHUDANPUR S. Acoustic data-driven pronunciation lexicon generation for logographic languages [C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai: IEEE, 2016: 5350-5354.
14. ZHANG X H, MANOHAR V, POVEY D, et al. Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework [C]//Interspeech 2017. Stockholm: ISCA, 2017: 2541-2545.
15. XU Q T, BAEVSKI A, AULI M. Simple and effective zero-shot cross-lingual phoneme recognition [DB/OL]. (2021-09-23). http://arxiv.org/abs/2109.11680
16. ARDILA R, BRANSON M, DAVIS K, et al. Common voice: A massively-multilingual speech corpus [DB/OL]. (2019-12-13). http://arxiv.org/abs/1912.06670
17. GALES M J F, KNILL K M, RAGNI A, et al. Speech recognition and keyword spotting for low-resource languages: Babel project research at cued [C]//Fourth International Workshop on Spoken Language Technologies for Under-Resourced Languages. St. Petersburg: ISCA, 2014: 16-23.
18. GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks [C]//23rd International Conference on Machine Learning. Pittsburgh: IMLS, 2006: 369-376.
19. NOVAK J R, MINEMATSU N, HIROSE K. Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework [J]. Natural Language Engineering, 2016, 22(6): 907-938.
20. POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit [C]//IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Hawaii: IEEE, 2011: 1-4.
21. PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition [C]//Interspeech 2019. Graz: ISCA, 2019: 2613-2617.
22. POVEY D, PEDDINTI V, GALVEZ D, et al. Purely sequence-trained neural networks for ASR based on lattice-free MMI [C]//Interspeech 2016. San Francisco: ISCA, 2016: 2751-2755.
23. STOLCKE A. SRILM - an extensible language modeling toolkit [C]//7th International Conference on Spoken Language Processing. Denver: ISCA, 2002: 1-4.
24. OTT M, EDUNOV S, BAEVSKI A, et al. Fairseq: A fast, extensible toolkit for sequence modeling [DB/OL]. (2019-04-01). http://arxiv.org/abs/1904.01038
25. CONNEAU A, BAEVSKI A, COLLOBERT R, et al. Unsupervised cross-lingual representation learning for speech recognition [DB/OL]. (2020-06-24). http://arxiv.org/abs/2006.13979
26. HEAFIELD K. KenLM: Faster and smaller language model queries [C]// Sixth Workshop on Statistical Machine Translation. Edinburgh: ACL, 2011: 187-197.
27. BISANI M, NEY H. Joint-sequence models for grapheme-to-phoneme conversion [J]. Speech Communication, 2008, 50(5): 434-451.

[1]	Xu Luzhen, Yan Haoyin, He Maokui, Guo Zixian, Zhou Yeping, Liu Peiqi, Zhang Jie, Dai Lirong. Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription [J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 298-304.
[2]	Liu Shuanghong, Song Zhida, He Liang. Improving ECAPA-TDNN Performance with Coordinate Attention [J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 241-247.
[3]	Zeng Bang, Suo Hongbin, Wan Yulong, Li Ming. Simultaneous Speech Extraction for Multiple Target Speakers Under Meeting Scenarios [J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 258-264.