Title :
Extracting domain-specific terms from unlabeled web documents by bootstrapping and term classifiers
Author :
Liu, Tao ; Wang, Xiao-long ; Liu, Bing-quan ; Liu, Yuan-Chao ; Li, Ming-Hui
Author_Institution :
Harbin Inst. of Technol., Harbin
Abstract :
Domain-specific term extraction contributes to all domain-oriented natural language processing tasks. Given a small set of domain-specific terms as seed terms, new terms from unlabeled corpora can be extracted by bootstrapping a term classifier to discover the association between seed terms and new terms. Traditional term representation method for domain-specific term extraction represents a term in a feature space of documents, which depicts association of terms which share common documents. This representation can´t depict the inner-document information of terms and requires extracted terms to occur in multiple documents. A new term representation method in global contextual space is proposed for domain-specific term extraction in this paper. This representation mechanism depicts the association of terms which share common global contexts. The information of terms within certain document and among corpora is depicted by global contexts. Experiments on Chinese web corpus show that the proposed domain-specific term extraction method with global contextual representation outperforms traditional method with representation mechanism in documents space. The improvement for low frequency terms is much higher for the proposed method.
Keywords :
Internet; information retrieval; natural language processing; pattern classification; text analysis; bootstrapping; contextual representation method; domain-oriented natural language processing task; domain-specific term classifier extraction; unlabeled Web documents; Costs; Data mining; Frequency; Large-scale systems; Machine learning; Natural language processing; Ontologies; Research and development; Search engines; Statistical analysis;
Conference_Titel :
Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on
Conference_Location :
Montreal, Que.
Print_ISBN :
978-1-4244-0990-7
Electronic_ISBN :
978-1-4244-0991-4
DOI :
10.1109/ICSMC.2007.4413834