DocumentCode :
3245796
Title :
Thematic text clustering for domain specific language model adaptation
Author :
Valsan, Zica ; Emele, Martin
Author_Institution :
Tangible User Interface Group, Sony Corporate Labs. Eur., Stuttgart, Germany
fYear :
2003
fDate :
30 Nov.-3 Dec. 2003
Firstpage :
513
Lastpage :
518
Abstract :
We propose a new approach for thematic text clustering. The text clusters are used to generate domain specific language models in order to address the problem of language model adaptation. The method relies on a new discriminative n-gram based term selection process (n>l), which reduces the influence of the corpus inhomogeneity, and outputs only semantically focused n-grams as being the most representative key terms in the corpus. These key terms are then used to automatically cluster the whole document collection and generate LM out of these text clusters. Different key term selection methods are evaluated using perplexity as a measure. Automatically computed clusters are compared with manually assigned labelling according to genre information. The results of these experimental studies are presented and discussed. Compared to the manual clustering a significant performance improvement between 21.87 % and 53.12 % is observed depending on the chosen key term selection method.
Keywords :
pattern clustering; speech recognition; text analysis; automatic clustering; discriminative n-gram process; document collection; domain specific language models; language model adaptation; perplexity measure; term selection process; thematic text clustering; Adaptation model; Domain specific languages; Europe; Information retrieval; Labeling; Laboratories; Natural languages; Speech recognition; Strontium; User interfaces;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Automatic Speech Recognition and Understanding, 2003. ASRU '03. 2003 IEEE Workshop on
Print_ISBN :
0-7803-7980-2
Type :
conf
DOI :
10.1109/ASRU.2003.1318493
Filename :
1318493
Link To Document :
بازگشت