مرکز منطقه ای اطلاع رساني علوم و فناوري - Using multiple features and statistical model to calculate text units similarity

DocumentCode :

2335013

Title :

Using multiple features and statistical model to calculate text units similarity

Author :

Xu, Yong-Dong ; Xu, Zhi-Ming ; Wang, Xiao-long ; Liu, Yuan-Chao ; Liu, Tao

Author_Institution :

Sch. of Comput. Sci. & Technol., Harbin Inst. of Technol., China

Volume :

fYear :

2005

fDate :

18-21 Aug. 2005

Firstpage :

3834

Abstract :

In many NLP applications, identifying similar information from a set of related documents is a common problem. In this paper, the similarity between two Chinese text units is determined by multiple features extracted from these units, including word statistical features, part of speech features, semantic features, word density feature and text discourse structure features. In addition, a statistical method based on logistic regression model is proposed to automatically fuse these features and calculate the similarity between text paragraphs. The experiment that compares this method with two popular used methods shows the effectiveness of this approach.

Keywords :

feature extraction; natural languages; regression analysis; text analysis; word processing; Chinese text unit; logistic regression model; natural language processing; statistical model; text discourse structure features; text units similarity; word statistical features; Application software; Computer science; Data mining; Electronic mail; Feature extraction; Fuses; Logistics; Speech; Statistical analysis; Web sites; Multi-document automatic summarization; logistic regression model; multiple features; text units similarity computation;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on

Conference_Location :

Guangzhou, China

Print_ISBN :

0-7803-9091-1

Type :

conf

DOI :

10.1109/ICMLC.2005.1527608

Filename :

1527608

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2335013