DocumentCode
2769348
Title
Visual Content Structures for Wrapper Induction in Building Metasearch Systems
Author
Tsay, Jyh-Jong ; Tsay, Chin-Wen ; Wang, Xin-Jie
Author_Institution
Dept. of Comput. Sci. & Inf. Eng., Nat. Chung Cheng Univ., Chiayi, Taiwan
Volume
1
fYear
2010
fDate
Aug. 31 2010-Sept. 3 2010
Firstpage
180
Lastpage
183
Abstract
As there are more and more online sources available on the Web, it becomes very time-consuming, if not impossible, to visit and search all web sites, one by one. Many search engines has been developed to help users find information of their need. However, search engines work poor for online sources whose data are often in deep web, which is not part of surface web indexed by standard search engines. Metasearch is a very popular mechanism to search deep web. Metasearch provides the capability for users to search and access all of the information sources in one query submission. One of the fundamental problems in building metasearch systems is to learn wrappers which extract and integrate data records from query result pages returned from online sources. In this paper, develop an unsupervised approach for wrapper induction that combines visual, content and HTML tag information. Our approach first learns a visual content model that alleviates HTML tag differences among data records, and then finds a tag model from all data records that match the visual content model. Experiment shows that our approach works well for data sets collected from well-known search engines and shopping websites.
Keywords
Web sites; hypermedia markup languages; query processing; retail data processing; search engines; HTML tag information; metasearch systems; query submission; search engines; shopping Websites; unsupervised approach; visual content structures; wrapper induction; VCWI; data record extraction; metasearch system; web information extraction; wrapper induction;
fLanguage
English
Publisher
ieee
Conference_Titel
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on
Conference_Location
Toronto, ON
Print_ISBN
978-1-4244-8482-9
Electronic_ISBN
978-0-7695-4191-4
Type
conf
DOI
10.1109/WI-IAT.2010.40
Filename
5616254
Link To Document