Title :
Investigation of using different Chinese word segmentation standards and algorithms for automatic speech recognition
Author :
Chongjia Ni ; Cheung-Chi Leung
Author_Institution :
Inst. for Infocomm Res. (I2R), A*STAR, Singapore, Singapore
Abstract :
Chinese word segmentation (CWS) is a necessary step in Mandarin Chinese automatic speech recognition (ASR), and it has an impact on the results of ASR. However, there are few works on the relations between CWS and ASR. CWS settings, including segmentation standards and algorithms, are involved in building a segmenter. In this paper, four CWS standards and three CWS algorithms, including maximum matching, term frequency based and conditional random field (CRF) based algorithms, are investigated for ASR performance. Our experiments on the second Sighan Bakeoff data and Mandarin Chinese conversational telephone speech show that a better segmentation performance does not necessarily lead to a better ASR performance. Maximum matching and the term frequency based algorithm, which are classified as lexicon-based algorithms, are more flexible to update their vocabulary inventories according to the application need. We find that these two algorithms can provide similar ASR performance as the CRF-based algorithm. Motivated by the availability of huge amounts of web text data, we investigate whether this can improve the term frequency based algorithm and thus the ASR performance. Lastly we find that combining the two lexicon-based algorithms through language model interpolation can further improve the ASR performance.
Keywords :
natural language processing; speech recognition; ASR performance; CRF-based algorithm; CWS algorithms; CWS settings; CWS standards; Chinese word segmentation standards; Mandarin Chinese automatic speech recognition; Mandarin Chinese conversational telephone speech; Sighan Bakeoff data; Web text data; conditional random field; language model interpolation; lexicon-based algorithms; maximum matching; segmenter; term frequency based algorithm; vocabulary inventories; Classification algorithms; Computational modeling; Data models; Speech; Standards; Training; Training data; Chinese word segmentation; Chinese word segmentation combination; automatic speech recognition;
Conference_Titel :
Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on
Conference_Location :
Singapore
DOI :
10.1109/ISCSLP.2014.6936684