DocumentCode
2665322
Title
Automatic extraction of the unlisted terms in the field of information technology based on the dynamic circulation corpus
Author
Wang, Qiangjun ; Park, Isabella ; Zhang, Pu
Author_Institution
Coll. of Humanity, Hebei Univ., China
fYear
2003
fDate
26-29 Oct. 2003
Firstpage
452
Lastpage
458
Abstract
We discuss automatic extraction of the unlisted terms in the field of information technology based on the large-scale DCC (dynamic circulation corpus), under the theory of dynamic updating of language and knowledge. It proposes the concept of concatenation index to decide whether a character string is a word/phrase or not. It also presents a new approach named "concatenation index + TFIDF" for extracting unlisted terms in large scale corpus of a certain field. The experiment selects the texts, around 17 million Chinese characters, in the field of IT (Information Technology) as the object corpus; and the texts, around 600 million Chinese characters, in the field of common usage as the contrast corpus. As a result, the tentative work flow has been established, and the approach turned out to be efficient.
Keywords
character recognition; information technology; natural languages; text analysis; automatic term extraction; character string; concatenation index; dynamic circulation corpus; dynamic updation theory; information technology; Data mining; Dictionaries; Educational institutions; ISO; Information technology; Large-scale systems; Measurement units; Terminology; Tiles; Web sites;
fLanguage
English
Publisher
ieee
Conference_Titel
Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on
Conference_Location
Beijing, China
Print_ISBN
0-7803-7902-0
Type
conf
DOI
10.1109/NLPKE.2003.1275949
Filename
1275949
Link To Document