Title :
CentralMatch: A Fast and Accurate Method to Identify Blog-Duplicates
Author :
Heejin Park ; Lee, Sang-Chul ; Lee, Soon-Haeng ; Kim, Sang-Wook
Author_Institution :
Dept. of Electron. & Comput. Eng., Hanyang Univ., Seoul, South Korea
fDate :
Aug. 31 2010-Sept. 3 2010
Abstract :
A group of documents is called near-duplicates if they are almost the same with just a slight difference. Since near-duplicates are major concerns of Web search engines, it is necessary to identify and filter them effectively. Among existing near-duplicate identification methods, MinHashing is the most well-known one. It identifies near-duplicates regardless of locations of different parts in two documents. In blog environment, however, most near-duplicates differ only in their beginning or end. According to our preliminary experiment, about 99% of near-duplicates differ in the beginning or end (blog-duplicates hereafter) and only 1% of them differ in the middle. Thus, blog-duplicates have a long matched sequence in their central parts. Based on this important observation, we present a novel algorithm, Central Match, to identify blog-duplicates efficiently and accurately. When searching a document database for possible log-duplicates of a given document, Central Match runs50 times faster than MinHashing. In addition, Central Match identifies blog-duplicates more accurately than MinHashing. According to our experiments, when the precisions of Min-Hashing and Central Match are fixed to 0.9, their recalls are around 0.5 and 0.9, respectively, which means Central Match finds 80% more blog-duplicates than MinHashing.
Keywords :
Internet; document handling; indexing; search engines; string matching; CentralMatch; MinHashing; Web search engines; blog-duplicate identification; document database; indexing; near-duplicate identification methods; string matching; Blog posts; Duplicate identification; Indexing; String matching; Web search engines;
Conference_Titel :
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on
Conference_Location :
Toronto, ON
Print_ISBN :
978-1-4244-8482-9
Electronic_ISBN :
978-0-7695-4191-4
DOI :
10.1109/WI-IAT.2010.98