Title :
Weighted shingling: an adaptation of shingling for weighted shingles
Author :
Gharghe, Zahra Eskandari ; Bidgoli, Behrouz Minaei
Author_Institution :
Iran Univ. of Sci. & Technol., Tehran, Iran
Abstract :
Broder´s shingling is one of the state-of-the-art approaches in detecting near-duplicate documents. Prior evaluations of this method have shown that document-pairs which have different main content but have a large amount of similar unimportant details are the main sources of its errors. Different web pages from the same site are a good example of such documents. In such pages, almost always there is a similar boilerplate text which has a chance to be selected as the document´s fingerprint and trick the algorithm. It seems that this problem is due to representing each document only by a sample of its shingles. This sample only contains some of the page´s shingles and discards any other information. by Including additional information such as frequencies of shingles in this sample, we can improve the performance of the algorithm. This paper proposes a weighting of shingles and adapts shingling to be applied on weighted shingles. Our results have shown an improvement in shingling´s performance.
Keywords :
Web sites; text analysis; Web pages; boilerplate text; near-duplicate document detection; weighted shingling; Fingerprint recognition; Frequency; Sampling methods; Search engines; Web pages;
Conference_Titel :
Innovations in Information Technology, 2009. IIT '09. International Conference on
Conference_Location :
Al Ain
Print_ISBN :
978-1-4244-5698-7
DOI :
10.1109/IIT.2009.5413370