• DocumentCode
    2290791
  • Title

    On the resemblance and containment of documents

  • Author

    Broder, Andrei Z.

  • Author_Institution
    Digital Systems Res. Center, Palo Alto, CA, USA
  • fYear
    1997
  • fDate
    11-13 Jun 1997
  • Firstpage
    21
  • Lastpage
    29
  • Abstract
    Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of “roughly the same” and “roughly contained.” The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using Rabin (1981) fingerprints
  • Keywords
    information retrieval; random processes; set theory; Rabin fingerprints; World Wide Web; containment; documents; fixed size sample; informal notions; information retrieval; intersection problems; mathematical notions; mathematical properties; random sampling; resemblance; roughly contained; roughly the same; Algorithm design and analysis; Clustering algorithms; Costs; Digital systems; Explosions; Fingerprint recognition; Particle measurements; Sampling methods; Testing; Web sites;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Compression and Complexity of Sequences 1997. Proceedings
  • Conference_Location
    Salerno
  • Print_ISBN
    0-8186-8132-2
  • Type

    conf

  • DOI
    10.1109/SEQUEN.1997.666900
  • Filename
    666900