Title :
Extracting Chinese multi-word terms from small corpus
Author :
Lang, Zhou ; Liang, Zhang ; Chong, Feng ; Heyan, Huang
Author_Institution :
Coll. of Comput. Sci. & Technol., Nanjing Univ. of Sci. & Technol., Nanjing, China
Abstract :
In this paper, we present an automatic terminology extraction approach for Chinese multi-word terms. In this term extraction system, besides five linguistic rules acquired from an available term list by some machine learning methods, two statistical strategies are involved: a termhood measure based on the term distribution variation, and a unithood measure adopting the left and right entropy method to estimate the collocation variation degree. The candidates are ranked according to the values of the former. The latter is used to filter the preposition phrases and some verb-object phrases that rarely appear as terms. By validating on a small scale corpus in the computer domain, the precision reaches 91.5% of the top 2000 outputs.
Keywords :
entropy; information retrieval; learning (artificial intelligence); natural language processing; text analysis; Chinese multiword terms; automatic terminology extraction; collocation variation degree; left entropy method; linguistic rules; machine learning; right entropy method; small corpus; statistical strategies; term distribution variation; term extraction system; term list; termhood measure; unithood measure; verb-object phrases; Computer science; Data mining; Filters; Intelligent systems; Knowledge engineering; Measurement units; Natural languages; Statistical analysis; Terminology; Testing;
Conference_Titel :
Intelligent System and Knowledge Engineering, 2008. ISKE 2008. 3rd International Conference on
Conference_Location :
Xiamen
Print_ISBN :
978-1-4244-2196-1
Electronic_ISBN :
978-1-4244-2197-8
DOI :
10.1109/ISKE.2008.4731041