Title :
Automatic extraction of the unlisted terms in the field of information technology based on the dynamic circulation corpus
Author :
Wang, Qiangjun ; Park, Isabella ; Zhang, Pu
Author_Institution :
Coll. of Humanity, Hebei Univ., China
Abstract :
We discuss automatic extraction of the unlisted terms in the field of information technology based on the large-scale DCC (dynamic circulation corpus), under the theory of dynamic updating of language and knowledge. It proposes the concept of concatenation index to decide whether a character string is a word/phrase or not. It also presents a new approach named "concatenation index + TFIDF" for extracting unlisted terms in large scale corpus of a certain field. The experiment selects the texts, around 17 million Chinese characters, in the field of IT (Information Technology) as the object corpus; and the texts, around 600 million Chinese characters, in the field of common usage as the contrast corpus. As a result, the tentative work flow has been established, and the approach turned out to be efficient.
Keywords :
character recognition; information technology; natural languages; text analysis; automatic term extraction; character string; concatenation index; dynamic circulation corpus; dynamic updation theory; information technology; Data mining; Dictionaries; Educational institutions; ISO; Information technology; Large-scale systems; Measurement units; Terminology; Tiles; Web sites;
Conference_Titel :
Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on
Conference_Location :
Beijing, China
Print_ISBN :
0-7803-7902-0
DOI :
10.1109/NLPKE.2003.1275949