• DocumentCode
    1654295
  • Title

    A Text Similarity Meta-Search Engine Based on Document Fingerprints and Search Results Records

  • Author

    Bravo-Marquez, Felipe ; L´Huillier, Gaston ; Ríos, Sebastián A. ; Velasquez, Juan David

  • Author_Institution
    Dept. of Ind. Eng., Univ. of Chile, Santiago, Chile
  • Volume
    1
  • fYear
    2011
  • Firstpage
    146
  • Lastpage
    153
  • Abstract
    The retrieval of similar documents from the Web using documents as input instead of key-term queries is not currently supported by traditional Web search engines. One approach for solving the problem consists of fingerprint the document´s content into a set of queries that are submitted to a list of Web search engines. Afterward, results are merged, their URLs are fetched and their content is compared with the given document using text comparison algorithms. However, the action of requesting results to multiple web servers could take a significant amount of time and effort. In this work, a similarity function between the given document and retrieved results is estimated. The function uses as variables features that come from information provided by search engine results records, like rankings, titles and snippets. Avoiding therefore, the bottleneck of requesting external Web Servers. We created a collection of around 10,000 search engine results by generating queries from 2,000 crawled Web documents. Then we fitted the similarity function using the cosine similarity between the input and results content as the target variable. The execution time between the exact and approximated solution was compared. Results obtained for our approximated solution showed a reduction of computational time of 86% at an acceptable level of precision with respect to the exact solution of the web document retrieval problem.
  • Keywords
    document handling; file servers; information retrieval; records management; search engines; text analysis; URL; Web document retrieval; Web search engines; Web server; document fingerprint; search result record; text comparison algorithm; text similarity meta search engine; Engines; Feature extraction; Fingerprint recognition; Metasearch; Search engines; Web search; Web servers; Document Fingerprinting; Meta-Search Engine; Query Generation; Ranking Fusion; Similar Document Retrieval;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence and Intelligent Agent Technology (WI-IAT), 2011 IEEE/WIC/ACM International Conference on
  • Conference_Location
    Lyon
  • Print_ISBN
    978-1-4577-1373-6
  • Electronic_ISBN
    978-0-7695-4513-4
  • Type

    conf

  • DOI
    10.1109/WI-IAT.2011.27
  • Filename
    6040511