DocumentCode
2621303
Title
Detecting similar HTML documents using a fuzzy set information retrieval approach
Author
Yerra, Rajiv ; Ng, Yiu-Kai
Author_Institution
Dept. of Comput. Sci., Brigham Young Univ., Provo, UT, USA
Volume
2
fYear
2005
fDate
25-27 July 2005
Firstpage
693
Abstract
Web documents that are either partially or completely duplicated in content are easily found on the Internet these days. Not only these documents create redundant information on the Web, which take longer to filter unique information and cause additional storage space, but also they degrade the efficiency of Web information retrieval. In this paper, we present a new approach for detecting similar Web documents, especially HTML documents. Our detection approach determines the odd ratio of any two documents, which makes use of the degrees of resemblance of the documents, and graphically displays the locations of similar (not necessary the same) sentences detected in the documents after (i) eliminating non-representative words in the sentences using the stopword-removal and stemming algorithms, (ii) computing the degree of similarity of sentences using a fuzzy set information retrieval approach, and (iii) matching the corresponding hierarchical content of the two documents using a simple tree matching algorithm. The proposed method for detecting similar documents handles wide range of Web pages of varying size and does not require static word lists and thus applicable to different Web (especially HTML) documents in different subject areas, such as sports, news, science, etc.
Keywords
Internet; fuzzy set theory; hypermedia markup languages; information retrieval; trees (mathematics); HTML document; Web document; Web information retrieval; Web page; copy detection; fuzzy set information retrieval; fuzzy set model; hierarchical content; stemming algorithm; stopword-removal algorithm; tree matching algorithm; Computer displays; Content based retrieval; Degradation; Fuzzy sets; HTML; Information filtering; Information filters; Information retrieval; Internet; Web pages; HTML document; Web information retrieval; copy detection; fuzzy set model; odds ratio;
fLanguage
English
Publisher
ieee
Conference_Titel
Granular Computing, 2005 IEEE International Conference on
Print_ISBN
0-7803-9017-2
Type
conf
DOI
10.1109/GRC.2005.1547380
Filename
1547380
Link To Document