DocumentCode
2539473
Title
A new text feature extraction model and its application in document copy detection
Author
Bao, Jun-peng ; Shen, Jun-Yi ; Liu, Xiao-dong ; Song, Qin-Bao
Author_Institution
Dept. of Comput. Sci. & Eng., Xi´´an Jiaotong Univ., China
Volume
1
fYear
2003
fDate
2-5 Nov. 2003
Firstpage
82
Abstract
Text feature extraction is a common issue in information retrieval, text mining, Web mining, text classification/clustering and document copy etc. The most popular approach is word frequency based scheme, which uses a word frequency vector to represent a document. Cosine function, dot product and proportion function are regular similarity measures of vector. But that is only global semantic feature of a document and loses local feature and structural information so that it prevents us to distinguish text well, especially in copy detection. In this paper we present a new text feature extraction model: semantic sequence model (SSM) that based on the concepts of word distance, word density and semantic sequence. The semantic sequences of a document contain not only local semantic features but also global feature and structural information, on which we get excellent accuracy of text copy detection. At the end of the paper, we contrast SSM with VSM and RFM and the experimental results show SSM is a superior model.
Keywords
feature extraction; information retrieval; probability; text analysis; vectors; cosine function; document copy detection; dot product; global semantic feature; information retrieval; local semantic features; plagiarism probability; proportion function; semantic sequence model; text copy detection; text feature extraction model; word density; word distance; word frequency based scheme; word frequency vector; Computer science; Electronic mail; Feature extraction; Fingerprint recognition; Frequency; Information retrieval; Plagiarism; Text categorization; Text mining; Web mining;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning and Cybernetics, 2003 International Conference on
Print_ISBN
0-7803-8131-9
Type
conf
DOI
10.1109/ICMLC.2003.1264447
Filename
1264447
Link To Document