Title :
Chinese Term Recognition and Extraction Based on Hidden Markov Model
Author :
Cen, Yonghua ; Han, Zhe ; Ji, PeiPei
Author_Institution :
Dept. of Inf. Manage., Nanjing Univ., Nanjing
Abstract :
Motivated by the probabilistic characteristics of syntax compositions especially POS (part of speech) matching of Chinese textual information and the inner structures of most unlexicalized Chinese domain terms, a system framework to recognize and extract domain-specific Chinese terms based on hidden Markov model (HMM) was proposed and implemented. The system learns the HMM parameters by the input training corpus with words roughly segmented and POS tagged by the ICTCLAS system developed by Chinese Academy of Sciences and term boundaries manually labeled. Based on HMM with the learned parameters knowledge, the system conducts term boundaries labeling for Chinese textual information from different domains and recognizes terms according to these boundaries. The system shows good performance, and the terms recognized can be treated as candidate terms for false-eliminating and optimizing combining with other parameters such as mutual information and domain dependency.
Keywords :
computational linguistics; hidden Markov models; knowledge based systems; natural language processing; probability; text analysis; Chinese Academy of Sciences; Chinese term recognition; Chinese textual information; ICTCLAS system; POS tagged; domain-specific Chinese terms; hidden Markov model; learned parameters knowledge; part of speech matching; probabilistic characteristics; syntax compositions; training corpus; unlexicalized Chinese domain terms; Character recognition; Data mining; Hidden Markov models; Information analysis; Information management; Information processing; Labeling; Mutual information; Speech analysis; Speech recognition; Chinese Term Recognition; HMM; Hidden Markov Model;
Conference_Titel :
Computational Intelligence and Industrial Application, 2008. PACIIA '08. Pacific-Asia Workshop on
Conference_Location :
Wuhan
Print_ISBN :
978-0-7695-3490-9
DOI :
10.1109/PACIIA.2008.242