Title :
A Weibo-Oriented Method for Unknown Word Extraction
Author :
Zhang, Shuai ; Liu, Qianren ; Wang, Lei
Author_Institution :
Sch. of Inf. & Commun. Eng., Beijing Univ. of Posts & Telecommun. Beijing, Beijing, China
Abstract :
Unknown word recognition is one of the most prominent and challenging problems in the Chinese language processing. Some effective approaches have been proposed, however, they do not work well on Chinese twitter (i.e. weibo) messages. In this paper, a method is presented to recognize unknown words from weibo. Due to the great flexibility in wording and highly correlation between unknown words and unpredictable topics, which are exhibited in weibo messages, the proposed method firstly groups the corpus into multiple categories by using K-means, then, from each of the categories, a morpheme set is derived based on local terms frequencies. Secondly, as for each potential unknown word in every morpheme set, a newly introduced measure (named adjacency degree) is calculated to see if a correct unknown word is found. It could be shown by the experiments that the proposed method is efficient, precise, and insensitive to the size of the weibo corpus.
Keywords :
Internet; natural language processing; Chinese language processing; Chinese twitter messgaes; Weibo Oriented Method; adjacency degree; morpheme set; unknown word extraction; Algorithm design and analysis; Clustering algorithms; Correlation; Data mining; Educational institutions; Statistical analysis; Vectors; Adjacency Degree; Improved K-means; Local Threshold; Unknown Word Extraction;
Conference_Titel :
Semantics, Knowledge and Grids (SKG), 2012 Eighth International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4673-2561-5
DOI :
10.1109/SKG.2012.15