مرکز منطقه ای اطلاع رساني علوم و فناوري - Quick asymmetric text similarity measures

DocumentCode :

2540464

Title :

Quick asymmetric text similarity measures

Author :

Bao, Jun-peng ; Shen, Jun-yi ; Liu, Xiao-dong ; Liu, Hai-yan

Author_Institution :

Dept. of Comput. Sci. & Eng., Xi´´an Jiaotong Univ., China

Volume :

fYear :

2003

fDate :

2-5 Nov. 2003

Firstpage :

374

Abstract :

Text similarity measure is a common issue in information retrieval, text mining, Web mining, text classification/clustering and document copy detection etc. The most popular approach is word frequency based scheme, which uses a word frequency vector to represent a document. Cosine function, dot product and proportion function are regular similarity measures of vector. But they are symmetric similarity measures, which cannot find out subset copies. In this paper we present the concepts of asymmetric similarity model and heavy frequency vector (HFV). The former can detect subset copies well; the latter can save a great resources and CPU time. We develop two new asymmetric measures: HFM and HIPM. The HFM and HIPM are derived from cosine function and proportion function by combining asymmetric similarity concept with HFV. The HFV is to truncate the original full frequency vector to a short vector. We can adjust the parameter of HFV to balance the model´s performance. Several experiments illustrate aspects of asymmetric similarity and HFV models in this paper.

Keywords :

information retrieval; text analysis; word processing; Web mining; asymmetric similarity model; cosine function; document copy detection; dot product; heavy frequency vector; information retrieval; proportion function; text classification; text clustering; text mining; text similarity measure; word frequency based scheme; word frequency vector; Computer science; Frequency; Functional analysis; Indexing; Information retrieval; Large scale integration; Statistics; Text categorization; Text mining; Web mining;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Machine Learning and Cybernetics, 2003 International Conference on

Print_ISBN :

0-7803-8131-9

Type :

conf

DOI :

10.1109/ICMLC.2003.1264505

Filename :

1264505

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2540464