DocumentCode :
2666126
Title :
Unsupervised incremental acquisition of a thematic corpus from the Web
Author :
Duclaye, Florence ; Yvon, F. ; Collin, Olivier
Author_Institution :
France Telecom R&D, Lannion, France
fYear :
2003
fDate :
26-29 Oct. 2003
Firstpage :
752
Lastpage :
757
Abstract :
We present a nearly unsupervised learning methodology for automatically acquiring a thematic corpus from the Web. Relying on a bootstrapping mechanism, our system starts with one single linguistic expression of a given target semantic relationship. It then samples the Web so as to progressively accumulate a corpus of potential examples of the same relationship. Sampling steps alternate with filtering steps, making it possible to keep the corpus thematically focused. The corpus is finally analysed to search for potential paraphrases of the initial expression of the semantic relationship. These paraphrases will eventually be used to improve our question-answering system. We focus on die learning aspect of the system and reports experimental results regarding the effectiveness of our filtering strategy.
Keywords :
Internet; linguistics; unsupervised learning; Web; automatic classification; bootstrap mechanism; die learning; linguistic expression; machine learning; machine-aided translation; paraphrase acquisition; question-answering system; thematic corpus; unsupervised learning; Automata; Conferences; Filtering; Inference algorithms; Research and development; Telecommunications; Thesauri; Training data; Unsupervised learning;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on
Conference_Location :
Beijing, China
Print_ISBN :
0-7803-7902-0
Type :
conf
DOI :
10.1109/NLPKE.2003.1276006
Filename :
1276006
Link To Document :
بازگشت