Title :
A hybrid method to segment words
Author :
Dai, Yubiao ; Ren, Xueli
Author_Institution :
Dept. of Comput. Sci. & Eng., QuJing Normal Univ., Qujing, China
Abstract :
Word segmentation is the foundations of machine translation, text classification and information searching. A method is proposed which combines word segmentation based on dictionary with reverse maximum matching and word segmentation based on statistic with suffix array. The input texts are segmented using the reserve maximum matching method based on dictionary, and a two-way suffix arrays are constructed, longest common prefix are computed, candidate words are filtered out by setting the threshold, the candidate words are filtered using mutual information in order to the true words. The texts that are ambiguity are filtered using information entropy. It is showed that the accuracy of word segmentation may achieve above 97% in the experiment.
Keywords :
language translation; natural language processing; pattern classification; text analysis; common prefix; hybrid method; information entropy; information searching; input texts; machine translation; suffix array; text classification; word segmentation; Accuracy; Arrays; Dictionaries; Information filters; Matched filters; Sorting;
Conference_Titel :
Audio, Language and Image Processing (ICALIP), 2012 International Conference on
Conference_Location :
Shanghai
Print_ISBN :
978-1-4673-0173-2
DOI :
10.1109/ICALIP.2012.6376786