• DocumentCode
    2731795
  • Title

    Collaborative Wrapping: A Turbo Framework for Web Data Extraction

  • Author

    Chuang, Shui-Lung ; Chang, Kevin Chen-Chuan ; Zhai, ChengXiang

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Illinois, Urbana-Champaign, IL
  • fYear
    2007
  • fDate
    15-20 April 2007
  • Firstpage
    1261
  • Lastpage
    1262
  • Abstract
    To access data sources on the Web, a crucial step is wrapping, which translates query responses, rendered in textual HTML, back into their relational form. Traditionally, this problem has been addressed with syntax-based approaches for a single source. However, as online databases multiply, we often need to wrap multiple sources, in particular for domain-based integration. Observing that sources in the same domain usually share common fields, we propose a novel wrapping concept - collaborative wrapping - where multiple sources are extracted concurrently with content-based synchronization to produce consentaneous extractions. Toward this concept, recognizing wrapping as a communication process, we develop the turbo wrapper, upon the insight of turbo codes - a multi-code decoding scheme in information theory. Our experiment shows that the turbo wrapper consistently outperforms baseline single-source methods, is robust, and does benefit from extended scales of source collaboration.
  • Keywords
    Internet; data integrity; hypermedia markup languages; query processing; Web data extraction; collaborative wrapping; content-based synchronization; domain-based integration; information theory; multicode decoding scheme; online databases; query responses; textual HTML; turbo codes; turbo wrapper; Art; Collaborative work; Computer science; Concrete; Data mining; HTML; Online Communities/Technical Collaboration; Relational databases; Turbo codes; Wrapping;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on
  • Conference_Location
    Istanbul
  • Print_ISBN
    1-4244-0802-4
  • Electronic_ISBN
    1-4244-0803-2
  • Type

    conf

  • DOI
    10.1109/ICDE.2007.368988
  • Filename
    4221778