• DocumentCode
    39108
  • Title

    An improved unsupervised approach to word segmentation

  • Author

    Hanshi Wang ; Xuhong Han ; Lizhen Liu ; Wei Song ; Mudan Yuan

  • Author_Institution
    Inf. & Eng. Coll, Capital Normal Univ., Beijing, China
  • Volume
    12
  • Issue
    7
  • fYear
    2015
  • fDate
    Jul-15
  • Firstpage
    82
  • Lastpage
    95
  • Abstract
    ESA is an unsupervised approach to word segmentation previously proposed by Wang, which is an iterative process consisting of three phases: Evaluation, Selection and Adjustment. In this article, we propose ExESA, the extension of ESA. In ExESA, the original approach is extended to a 2-pass process and the ratio of different word lengths is introduced as the third type of information combined with cohesion and separation. A maximum strategy is adopted to determine the best segmentation of a character sequence in the phrase of Selection. Besides, in Adjustment, ExESA re-evaluates separation information and individual information to overcome the overestimation frequencies. Additionally, a smoothing algorithm is applied to alleviate sparseness. The experiment results show that ExESA can further improve the performance and is time-saving by properly utilizing more information from un-annotated corpora. Moreover, the parameters of ExESA can be predicted by a set of empirical formulae or combined with the minimum description length principle.
  • Keywords
    iterative methods; natural language processing; smoothing methods; text analysis; unsupervised learning; ExESA; character sequence segmentation; cohesion; improved unsupervised approach; iterative process; minimum description length principle; overestimation frequency; separation information; smoothing algorithm; word length; word segmentation; Accuracy; Entropy; Frequency measurement; Length measurement; Prediction algorithms; Smoothing methods; Uncertainty; character sequence; maximum strategy; smoothing algorithm; word segmentation;
  • fLanguage
    English
  • Journal_Title
    Communications, China
  • Publisher
    ieee
  • ISSN
    1673-5447
  • Type

    jour

  • DOI
    10.1109/CC.2015.7188527
  • Filename
    7188527