• DocumentCode
    710099
  • Title

    Scaling up copy detection

  • Author

    Xian Li ; Xin Luna Dong ; Lyons, Kenneth B. ; Weiyi Meng ; Srivastava, Divesh

  • Author_Institution
    Comput. Sci. Dept., Binghamton Univ., Binghamton, NY, USA
  • fYear
    2015
  • fDate
    13-17 April 2015
  • Firstpage
    89
  • Lastpage
    100
  • Abstract
    Recent research shows that copying is prevalent for Deep-Web data and considering copying can significantly improve truth finding from conflicting values. However, existing copy detection techniques do not scale for large sizes and numbers of data sources, so truth finding can be slowed down by one to two orders of magnitude compared with the corresponding techniques that do not consider copying. In this paper, we study how to improve scalability of copy detection on structured data. Our algorithm builds an inverted index for each shared value and processes the index entries in decreasing order of how much the shared value can contribute to the conclusion of copying. We show how we use the index to prune the data items we consider for each pair of sources, and to incrementally refine our results in iterative copy detection. We also apply a sampling strategy with which we are able to further reduce copy-detection time while still obtaining very similar results as on the whole data set. Experiments on various real data sets show that our algorithm can reduce the time for copy detection by two to three orders of magnitude; in other words, truth finding can benefit from copy detection with very little overhead.
  • Keywords
    Internet; copy protection; conflicting values; copy detection techniques; copy-detection time reduction; data pruning; data sources; deep-Web data; index entry processing; inverted index; iterative copy detection; sampling strategy; scalability improvement; shared value; structured data; truth finding improvement; Accuracy; Buildings; Convergence; Distributed databases; Indexes; Knowledge based systems; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2015 IEEE 31st International Conference on
  • Conference_Location
    Seoul
  • Type

    conf

  • DOI
    10.1109/ICDE.2015.7113275
  • Filename
    7113275