Title :
Collaborative Wrapping: A Turbo Framework for Web Data Extraction
Author :
Chuang, Shui-Lung ; Chang, Kevin Chen-Chuan ; Zhai, ChengXiang
Author_Institution :
Dept. of Comput. Sci., Univ. of Illinois, Urbana-Champaign, IL
Abstract :
To access data sources on the Web, a crucial step is wrapping, which translates query responses, rendered in textual HTML, back into their relational form. Traditionally, this problem has been addressed with syntax-based approaches for a single source. However, as online databases multiply, we often need to wrap multiple sources, in particular for domain-based integration. Observing that sources in the same domain usually share common fields, we propose a novel wrapping concept - collaborative wrapping - where multiple sources are extracted concurrently with content-based synchronization to produce consentaneous extractions. Toward this concept, recognizing wrapping as a communication process, we develop the turbo wrapper, upon the insight of turbo codes - a multi-code decoding scheme in information theory. Our experiment shows that the turbo wrapper consistently outperforms baseline single-source methods, is robust, and does benefit from extended scales of source collaboration.
Keywords :
Internet; data integrity; hypermedia markup languages; query processing; Web data extraction; collaborative wrapping; content-based synchronization; domain-based integration; information theory; multicode decoding scheme; online databases; query responses; textual HTML; turbo codes; turbo wrapper; Art; Collaborative work; Computer science; Concrete; Data mining; HTML; Online Communities/Technical Collaboration; Relational databases; Turbo codes; Wrapping;
Conference_Titel :
Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on
Conference_Location :
Istanbul
Print_ISBN :
1-4244-0802-4
Electronic_ISBN :
1-4244-0803-2
DOI :
10.1109/ICDE.2007.368988