Title :
Web-Based Language Model Domain Adaptation for Real World Voice Retrieval
Author :
Mengzhe Chen ; Qingqing Zhang ; Zhichao Wang ; Jielin Pan ; Yonghong Yan
Author_Institution :
Key Lab. of Speech Acoust. & Content Understanding, Beijing, China
Abstract :
This paper presents our recent work on the development of a real world voice retrieval system, which automatically updates language models for a specific domain with the latest web data. Two of the main difficult issues in handling this system are tackled in this paper. First, when people use voice retrieval systems, new created "hot words" are inputted as the keywords. In order to ensure the quality of the user experience, it is important to increase the recognition performance of these hot words. Second, for our applications, the retrieval domain is given. How to automatically select in domain data from the web data and update domain-specific language models is another problem which needs to be solved. To address these issues, in the system the latest text training data are obtained by searching web data related to the top ranking hot words. Based on the data, a block-based language modeling method is proposed to automatically build and update domain-specific language models. Meanwhile, in-domain high frequency words and phrases are added into the lexicon for the lexicon updating. From real world users\´ voice retrieval dataset, experimental results showed that through the updating of our system, consistent improvements were achieved for in-domain voice retrieval recognition.
Keywords :
Internet; information retrieval; natural language processing; text analysis; Web data search; Web-based language model domain adaptation; automatic language model update; block-based language modeling method; domain-specific language models; hot word recognition; in-domain high-frequency phrases; in-domain high-frequency words; in-domain voice retrieval recognition; lexicon updating; recognition performance; text training data; voice retrieval dataset; voice retrieval system; Adaptation models; Data models; Entertainment industry; Hidden Markov models; Speech recognition; Training; Training data; blockbased language model; domain-specific language model; voice retrieval;
Conference_Titel :
Computational Intelligence and Security (CIS), 2013 9th International Conference on
Conference_Location :
Leshan
Print_ISBN :
978-1-4799-2548-3
DOI :
10.1109/CIS.2013.28