• DocumentCode
    3473144
  • Title

    Weighted shingling: an adaptation of shingling for weighted shingles

  • Author

    Gharghe, Zahra Eskandari ; Bidgoli, Behrouz Minaei

  • Author_Institution
    Iran Univ. of Sci. & Technol., Tehran, Iran
  • fYear
    2009
  • fDate
    15-17 Dec. 2009
  • Firstpage
    150
  • Lastpage
    154
  • Abstract
    Broder´s shingling is one of the state-of-the-art approaches in detecting near-duplicate documents. Prior evaluations of this method have shown that document-pairs which have different main content but have a large amount of similar unimportant details are the main sources of its errors. Different web pages from the same site are a good example of such documents. In such pages, almost always there is a similar boilerplate text which has a chance to be selected as the document´s fingerprint and trick the algorithm. It seems that this problem is due to representing each document only by a sample of its shingles. This sample only contains some of the page´s shingles and discards any other information. by Including additional information such as frequencies of shingles in this sample, we can improve the performance of the algorithm. This paper proposes a weighting of shingles and adapts shingling to be applied on weighted shingles. Our results have shown an improvement in shingling´s performance.
  • Keywords
    Web sites; text analysis; Web pages; boilerplate text; near-duplicate document detection; weighted shingling; Fingerprint recognition; Frequency; Sampling methods; Search engines; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Innovations in Information Technology, 2009. IIT '09. International Conference on
  • Conference_Location
    Al Ain
  • Print_ISBN
    978-1-4244-5698-7
  • Type

    conf

  • DOI
    10.1109/IIT.2009.5413370
  • Filename
    5413370