Title :
An improved unsupervised approach to word segmentation
Author :
Hanshi Wang ; Xuhong Han ; Lizhen Liu ; Wei Song ; Mudan Yuan
Author_Institution :
Inf. & Eng. Coll, Capital Normal Univ., Beijing, China
Abstract :
ESA is an unsupervised approach to word segmentation previously proposed by Wang, which is an iterative process consisting of three phases: Evaluation, Selection and Adjustment. In this article, we propose ExESA, the extension of ESA. In ExESA, the original approach is extended to a 2-pass process and the ratio of different word lengths is introduced as the third type of information combined with cohesion and separation. A maximum strategy is adopted to determine the best segmentation of a character sequence in the phrase of Selection. Besides, in Adjustment, ExESA re-evaluates separation information and individual information to overcome the overestimation frequencies. Additionally, a smoothing algorithm is applied to alleviate sparseness. The experiment results show that ExESA can further improve the performance and is time-saving by properly utilizing more information from un-annotated corpora. Moreover, the parameters of ExESA can be predicted by a set of empirical formulae or combined with the minimum description length principle.
Keywords :
iterative methods; natural language processing; smoothing methods; text analysis; unsupervised learning; ExESA; character sequence segmentation; cohesion; improved unsupervised approach; iterative process; minimum description length principle; overestimation frequency; separation information; smoothing algorithm; word length; word segmentation; Accuracy; Entropy; Frequency measurement; Length measurement; Prediction algorithms; Smoothing methods; Uncertainty; character sequence; maximum strategy; smoothing algorithm; word segmentation;
Journal_Title :
Communications, China
DOI :
10.1109/CC.2015.7188527