Auto-Identifying Terms Based on a Place-Extending Method

Author

Zezhi Zheng

Author_Institution

Dept. of Chinese Language & Literature, Xiamen Univ., Xiamen, China

fYear

2011

fDate

17-18 July 2011

Firstpage

1

Lastpage

5

Abstract

The normalized relative frequency ratio is used as the domain differential degree to estimate the domain feature of a string; the sequence correlation coefficient is used to judge the stability of a string. The identifying process takes two steps. 1) Get term seeds. Extract adjacent character pairs from the domain corpus and the general corpus respectively. Then obtain term seeds by sifting the adjacency pairs with the domain differential degree, mutual information and the taboo character list jointly; 2) Gain terms. With strategy of verbatim extending, take the term seeds as anchor points. Then extend each seeds to its both sides verbatim. Leach every spread character with the sequence correlation coefficients, exceptional-correct rules and the taboo word list in turn. Take the terms with the character, as an example. The test showed that the precision and the recall rate of the algorithm reached 86.73% and 85.91%, respectively.

Keywords

character recognition; correlation methods; feature extraction; sequences; string matching; domain differential degree; place extending method; sequence correlation coefficient; string stability; taboo word list; Correlation; Data mining; Feature extraction; Mutual information; Physics; Time frequency analysis;

fLanguage

English

Publisher

ieee

Conference_Titel

Circuits, Communications and System (PACCS), 2011 Third Pacific-Asia Conference on

Conference_Location

Wuhan

Print_ISBN

978-1-4577-0855-8

Type

conf

DOI

10.1109/PACCS.2011.5990133

Filename

5990133