• DocumentCode
    2769348
  • Title

    Visual Content Structures for Wrapper Induction in Building Metasearch Systems

  • Author

    Tsay, Jyh-Jong ; Tsay, Chin-Wen ; Wang, Xin-Jie

  • Author_Institution
    Dept. of Comput. Sci. & Inf. Eng., Nat. Chung Cheng Univ., Chiayi, Taiwan
  • Volume
    1
  • fYear
    2010
  • fDate
    Aug. 31 2010-Sept. 3 2010
  • Firstpage
    180
  • Lastpage
    183
  • Abstract
    As there are more and more online sources available on the Web, it becomes very time-consuming, if not impossible, to visit and search all web sites, one by one. Many search engines has been developed to help users find information of their need. However, search engines work poor for online sources whose data are often in deep web, which is not part of surface web indexed by standard search engines. Metasearch is a very popular mechanism to search deep web. Metasearch provides the capability for users to search and access all of the information sources in one query submission. One of the fundamental problems in building metasearch systems is to learn wrappers which extract and integrate data records from query result pages returned from online sources. In this paper, develop an unsupervised approach for wrapper induction that combines visual, content and HTML tag information. Our approach first learns a visual content model that alleviates HTML tag differences among data records, and then finds a tag model from all data records that match the visual content model. Experiment shows that our approach works well for data sets collected from well-known search engines and shopping websites.
  • Keywords
    Web sites; hypermedia markup languages; query processing; retail data processing; search engines; HTML tag information; metasearch systems; query submission; search engines; shopping Websites; unsupervised approach; visual content structures; wrapper induction; VCWI; data record extraction; metasearch system; web information extraction; wrapper induction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on
  • Conference_Location
    Toronto, ON
  • Print_ISBN
    978-1-4244-8482-9
  • Electronic_ISBN
    978-0-7695-4191-4
  • Type

    conf

  • DOI
    10.1109/WI-IAT.2010.40
  • Filename
    5616254