DocumentCode :
2539473
Title :
A new text feature extraction model and its application in document copy detection
Author :
Bao, Jun-peng ; Shen, Jun-Yi ; Liu, Xiao-dong ; Song, Qin-Bao
Author_Institution :
Dept. of Comput. Sci. & Eng., Xi´´an Jiaotong Univ., China
Volume :
1
fYear :
2003
fDate :
2-5 Nov. 2003
Firstpage :
82
Abstract :
Text feature extraction is a common issue in information retrieval, text mining, Web mining, text classification/clustering and document copy etc. The most popular approach is word frequency based scheme, which uses a word frequency vector to represent a document. Cosine function, dot product and proportion function are regular similarity measures of vector. But that is only global semantic feature of a document and loses local feature and structural information so that it prevents us to distinguish text well, especially in copy detection. In this paper we present a new text feature extraction model: semantic sequence model (SSM) that based on the concepts of word distance, word density and semantic sequence. The semantic sequences of a document contain not only local semantic features but also global feature and structural information, on which we get excellent accuracy of text copy detection. At the end of the paper, we contrast SSM with VSM and RFM and the experimental results show SSM is a superior model.
Keywords :
feature extraction; information retrieval; probability; text analysis; vectors; cosine function; document copy detection; dot product; global semantic feature; information retrieval; local semantic features; plagiarism probability; proportion function; semantic sequence model; text copy detection; text feature extraction model; word density; word distance; word frequency based scheme; word frequency vector; Computer science; Electronic mail; Feature extraction; Fingerprint recognition; Frequency; Information retrieval; Plagiarism; Text categorization; Text mining; Web mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Cybernetics, 2003 International Conference on
Print_ISBN :
0-7803-8131-9
Type :
conf
DOI :
10.1109/ICMLC.2003.1264447
Filename :
1264447
Link To Document :
بازگشت