DocumentCode :
3738571
Title :
THUEE language modeling method for the OpenKWS 2015 evaluation
Author :
Zhuo Zhang;Wei-Qiang Zhang;Kai-Xiang Shen;Xu-Kui Yang;Yao Tian;Meng Cai;Jia Liu
Author_Institution :
Tsinghua National Laboratory for Information Science and Technology Department of Electronic Engineering, Tsinghua University, Beijing 100084
fYear :
2015
Firstpage :
534
Lastpage :
538
Abstract :
In this paper, we describe the THUEE (Department of Electronic Engineering, Tsinghua University) team´s method of building language models (LMs) for the OpenKWS 2015 Evaluation held by the National Institute of Standards and Technology (NIST). Due to the very limited in-domain data provided by NIST, it takes most of our time and efforts to make good use of the out-of-domain data. There are three main steps in our work. Firstly, data cleaning has been done on the out-of-domain data. Secondly, by comparing the cross-entropy difference between the in-domain data and out-of-domain data, a part of the out-of-domain corpus which is well-matched to the in-domain one has been selected as the training corpus. Thirdly, the final n-gram LM is obtained by interpolating individual n-gram LMs according to different training corpus and all the training data is further combined to train one feed-forward neural network LM (FNNLM). In this way, we reduce the perplexity on development test data by 8.3% for n-gram LM and 1.7% for FNNLM, and the Actual Term-Weighted Value (ATWV) of the final result is 0.5391.
Keywords :
"Training","Data models","Training data","NIST","Interpolation","Buildings","Cleaning"
Publisher :
ieee
Conference_Titel :
Signal Processing and Information Technology (ISSPIT), 2015 IEEE International Symposium on
Type :
conf
DOI :
10.1109/ISSPIT.2015.7394394
Filename :
7394394
Link To Document :
بازگشت