Automatic extraction of the unlisted terms in the field of information technology based on the dynamic circulation corpus

Author

Wang, Qiangjun ; Park, Isabella ; Zhang, Pu

Author_Institution

Coll. of Humanity, Hebei Univ., China

fYear

2003

fDate

26-29 Oct. 2003

Firstpage

452

Lastpage

458

Abstract

We discuss automatic extraction of the unlisted terms in the field of information technology based on the large-scale DCC (dynamic circulation corpus), under the theory of dynamic updating of language and knowledge. It proposes the concept of concatenation index to decide whether a character string is a word/phrase or not. It also presents a new approach named "concatenation index + TFIDF" for extracting unlisted terms in large scale corpus of a certain field. The experiment selects the texts, around 17 million Chinese characters, in the field of IT (Information Technology) as the object corpus; and the texts, around 600 million Chinese characters, in the field of common usage as the contrast corpus. As a result, the tentative work flow has been established, and the approach turned out to be efficient.

Keywords

character recognition; information technology; natural languages; text analysis; automatic term extraction; character string; concatenation index; dynamic circulation corpus; dynamic updation theory; information technology; Data mining; Dictionaries; Educational institutions; ISO; Information technology; Large-scale systems; Measurement units; Terminology; Tiles; Web sites;

fLanguage

English

Publisher

ieee

Conference_Titel

Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on

Conference_Location

Beijing, China

Print_ISBN

0-7803-7902-0

Type

conf

DOI

10.1109/NLPKE.2003.1275949

Filename

1275949