Title :
Domain corpus independent vocabulary generation for embedded continuous speech recognition
Author :
Lim, Minkyu ; Kim, Kwang-Ho ; Kim, Ji-Hwan
Author_Institution :
Dept. of Comput. Sci. & Eng., Sogang Univ., Seoul, South Korea
fDate :
8/1/2009 12:00:00 AM
Abstract :
This paper proposes a domain corpus independent vocabulary generation algorithm in order to improve the coverage of vocabulary for embedded continuous speech recognition (CSR). A vocabulary in CSR is normally derived from a word frequency list. Therefore, the vocabulary coverage is dependent on a domain corpus. We present an improved way of vocabulary generation using part-of-speech (POS) tagged corpus and knowledge base. We investigate 152 POS tags defined in a POS tagged corpus and word-POS tag pairs. We analyze all words paired with 101 among 152 POS tags and decide on a set of words which have to be included in vocabularies of any size. The other 51 POS tags are mainly categorized with noun-related, named entity (NE)-related and verb-related POSs. We introduce a domain corpus independent word inclusion method for noun-, verb-, and NE-related POS tags using knowledge base. For noun-related POS tags, we generate synonym groups and analyze their relative importance using Google search. Then, we categorize verbs by lemma and analyze relative importance of each lemma from a pre-analyzed statistic for verbs. We determine the inclusion order of NEs through Google search. The proposed method shows at least 28.6% relative improvement of coverage for a SMS text corpus when the sizes of vocabulary are 5 K, 10 K, 15 K and 20 K. In particular, the coverage of 15 K size vocabulary generated by the proposed method reaches up to 97.8% with the relative improvement of 44.2%.
Keywords :
search engines; speech recognition; vocabulary; Google search; domain corpus independent vocabulary generation algorithm; domain corpus independent word inclusion method; embedded continuous speech recognition; knowledge base system; named entity related part-of-speech; part-of-speech tagged corpus; verb-related part of speech; word frequency list; Computer science; Context modeling; Dictionaries; Frequency; Natural languages; Space exploration; Space technology; Speech recognition; Statistical analysis; Vocabulary; Coverage; Domain corpus independent; Embedded speech recognition; Vocabulary;
Journal_Title :
Consumer Electronics, IEEE Transactions on
DOI :
10.1109/TCE.2009.5278036