Title :
Removing DUST Using Multiple Alignment of Sequences
Author :
Rodrigues, Kaio ; Cristo, Marco ; S de Moura, Edleno ; da Silva, Altigran
Author_Institution :
Inst. of Comput. Sci., Fed. Univ. of Amazonas, Manaus, Brazil
Abstract :
A large number of URLs collected by web crawlers correspond to pages with duplicate or near-duplicate contents. To crawl, store, and use such duplicated data implies a waste of resources, the building of low quality rankings, and poor user experiences. To deal with this problem, several studies have been proposed to detect and remove duplicate documents without fetching their contents. To accomplish this, the proposed methods learn normalization rules to transform all duplicate URLs into the same canonical form. A challenging aspect of this strategy is deriving a set of general and precise rules. In this work, we present DUSTER, a new approach to derive quality rules that take advantage of a multi-sequence alignment strategy. We demonstrate that a full multi-sequence alignment of URLs with duplicated content, before the generation of the rules, can lead to the deployment of very effective rules. By evaluating our method, we observed it achieved larger reductions in the number of duplicate URLs than our best baseline, with gains of 82 and 140.74 percent in two different web collections.
Keywords :
Internet; data mining; information retrieval; DUSTER; URL; Web collections; Web crawlers; content fetching; duplicate document removal; multiple alignment; multisequence alignment strategy; near-duplicate contents; ranking quality; Algorithm design and analysis; Crawlers; Noise; Search engines; Training; Transforms; Uniform resource locators; Web technology; web crawling and normalization rules;
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
DOI :
10.1109/TKDE.2015.2407354