DocumentCode :
2543506
Title :
Extracting domain-specific terms from unlabeled web documents by bootstrapping and term classifiers
Author :
Liu, Tao ; Wang, Xiao-long ; Liu, Bing-quan ; Liu, Yuan-Chao ; Li, Ming-Hui
Author_Institution :
Harbin Inst. of Technol., Harbin
fYear :
2007
fDate :
7-10 Oct. 2007
Firstpage :
3875
Lastpage :
3880
Abstract :
Domain-specific term extraction contributes to all domain-oriented natural language processing tasks. Given a small set of domain-specific terms as seed terms, new terms from unlabeled corpora can be extracted by bootstrapping a term classifier to discover the association between seed terms and new terms. Traditional term representation method for domain-specific term extraction represents a term in a feature space of documents, which depicts association of terms which share common documents. This representation can´t depict the inner-document information of terms and requires extracted terms to occur in multiple documents. A new term representation method in global contextual space is proposed for domain-specific term extraction in this paper. This representation mechanism depicts the association of terms which share common global contexts. The information of terms within certain document and among corpora is depicted by global contexts. Experiments on Chinese web corpus show that the proposed domain-specific term extraction method with global contextual representation outperforms traditional method with representation mechanism in documents space. The improvement for low frequency terms is much higher for the proposed method.
Keywords :
Internet; information retrieval; natural language processing; pattern classification; text analysis; bootstrapping; contextual representation method; domain-oriented natural language processing task; domain-specific term classifier extraction; unlabeled Web documents; Costs; Data mining; Frequency; Large-scale systems; Machine learning; Natural language processing; Ontologies; Research and development; Search engines; Statistical analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on
Conference_Location :
Montreal, Que.
Print_ISBN :
978-1-4244-0990-7
Electronic_ISBN :
978-1-4244-0991-4
Type :
conf
DOI :
10.1109/ICSMC.2007.4413834
Filename :
4413834
Link To Document :
بازگشت