مرکز منطقه ای اطلاع رساني علوم و فناوري - Chinese Unknown Words Extraction Based on Word-Level Characteristics

DocumentCode :

501728

Title :

Chinese Unknown Words Extraction Based on Word-Level Characteristics

Author :

Pang, Wenbo ; Fan, Xiaozhong ; Gu, Yijun ; Yu, Jiangde

Author_Institution :

Sch. of Comput. & Technol., Beijing Inst. of Technol., Beijing, China

Volume :

fYear :

2009

fDate :

12-14 Aug. 2009

Firstpage :

361

Lastpage :

366

Abstract :

The automatic recognition of unknown words is an important problem in Chinese information processing. Based on the characteristics of words, this paper proposes a method to recognize new words using high frequent strings. Firstly, the high frequent strings from each single document are extracted as candidate strings. Then the strings that cannot satisfy the characteristics of wordpsilas distribution and wordpsilas independently usage are removed. Finally, segment the entire corpus with these candidate strings, and count the word-frequency for further filtering. Experimental results show that, on the documents about basketball downloaded from Zaobao Newspaper, this method achieves an F-score of 79.39%.

Keywords :

document handling; information filtering; natural language processing; pattern recognition; Chinese information processing; Chinese unknown word extraction; automatic recognition; candidate string; corpus segmentation; document extraction; information filtering; word distribution; word-level characteristics; Character recognition; Data mining; Dictionaries; Educational institutions; Entropy; Frequency; Hybrid intelligent systems; Information processing; Information security; Mutual information; Chinese unknown word; independent usage; word distribution;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Hybrid Intelligent Systems, 2009. HIS '09. Ninth International Conference on

Conference_Location :

Shenyang

Print_ISBN :

978-0-7695-3745-0

Type :

conf

DOI :

10.1109/HIS.2009.77

Filename :

5254333

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=501728