• DocumentCode
    2254619
  • Title

    Application of Conditional Random Fields model in unknown words identification

  • Author

    Zhang, Hai-Jun ; Pan, Wei-min ; Shi, Shu-min ; Zhu, Chao-yong

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Xinjiang Normal Univ., Urumqi, China
  • Volume
    4
  • fYear
    2010
  • fDate
    11-14 July 2010
  • Firstpage
    1839
  • Lastpage
    1843
  • Abstract
    This paper proposed a method for Unknown Words Identification (UWI) based on repeats. To identify Unknown words with reliable theory, we put forward a formal model for the process of UWI, which can give directions on the selection of features used in UWI in theory. For the formal model, we propose employing Conditional Random Fields model (CRF) as statistical frame to resolve it. Under the statistical frame, UWI is converted to the process of exploiting effective features that can represent the essences of unknown words. The experiments show that the method of this paper is effective, and reasonable combination of features used in CRF can evidently improve the result of UWI. The ultimate result (F score) of this method is 47.81% and 69.83% in open test and word extraction respectively, which is better over the best result reported in previous works.
  • Keywords
    natural language processing; statistical analysis; conditional random field model; feature selection; statistical frame model; unknown word identification; Cybernetics; Data mining; Entropy; Feature extraction; Helium; Machine learning; Training; CRF; Chinese word segmentation; Feature combination; Repeats; Unknown words identification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics (ICMLC), 2010 International Conference on
  • Conference_Location
    Qingdao
  • Print_ISBN
    978-1-4244-6526-2
  • Type

    conf

  • DOI
    10.1109/ICMLC.2010.5580955
  • Filename
    5580955