• DocumentCode
    2665322
  • Title

    Automatic extraction of the unlisted terms in the field of information technology based on the dynamic circulation corpus

  • Author

    Wang, Qiangjun ; Park, Isabella ; Zhang, Pu

  • Author_Institution
    Coll. of Humanity, Hebei Univ., China
  • fYear
    2003
  • fDate
    26-29 Oct. 2003
  • Firstpage
    452
  • Lastpage
    458
  • Abstract
    We discuss automatic extraction of the unlisted terms in the field of information technology based on the large-scale DCC (dynamic circulation corpus), under the theory of dynamic updating of language and knowledge. It proposes the concept of concatenation index to decide whether a character string is a word/phrase or not. It also presents a new approach named "concatenation index + TFIDF" for extracting unlisted terms in large scale corpus of a certain field. The experiment selects the texts, around 17 million Chinese characters, in the field of IT (Information Technology) as the object corpus; and the texts, around 600 million Chinese characters, in the field of common usage as the contrast corpus. As a result, the tentative work flow has been established, and the approach turned out to be efficient.
  • Keywords
    character recognition; information technology; natural languages; text analysis; automatic term extraction; character string; concatenation index; dynamic circulation corpus; dynamic updation theory; information technology; Data mining; Dictionaries; Educational institutions; ISO; Information technology; Large-scale systems; Measurement units; Terminology; Tiles; Web sites;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on
  • Conference_Location
    Beijing, China
  • Print_ISBN
    0-7803-7902-0
  • Type

    conf

  • DOI
    10.1109/NLPKE.2003.1275949
  • Filename
    1275949