• DocumentCode
    480705
  • Title

    Discriminating Meaningful Web Tables from Decorative Tables Using a Composite Kernel

  • Author

    Son, Jeong-Woo ; Lee, Jae-An ; Park, Seong-Bae ; Song, Hyun-Je ; Lee, Sang-Jo ; Park, Se-Young

  • Author_Institution
    Dept. of Comput. Eng., Kyungpook Nat. Univ., Daegu
  • Volume
    1
  • fYear
    2008
  • fDate
    9-12 Dec. 2008
  • Firstpage
    368
  • Lastpage
    371
  • Abstract
    Information extraction from world wide web has been paid great attention to. Since a table is a well-organized and summarized knowledge expression for a domain, it is of great importance to extract information from the tables. However, many tables in web pages are used not to transfer information but to decorate the pages. Therefore, it is one of the most critical tasks in web table mining to discriminate the meaningful tables from the decorative ones. The main obstacle of this task comes from the difficulty of generating relevant features for the discrimination. This paper proposes a novel method to discriminate them using a composite kernel which combines a parse tree kernel and a linear kernel. Since a web table is represented as a parse tree by a HTML parser, the parse tree kernel can be naturally used in determining the similarity between trees, and the linear kernel with content features is used to make up for the weak points of the parse tree kernel. The support vector machines with the composite kernel distinguish with high accuracy the meaningful tables from the decorative ones. A series of experiments show that the proposed method achieves the state-of-the-art performance.
  • Keywords
    Internet; hypermedia markup languages; information retrieval; program compilers; trees (mathematics); HTML parser; Web tables; World Wide Web; composite kernel; decorative tables; information extraction; parse tree; Intelligent agent; Kernel; Composite Kernel; Machine Learning; Web Table Discrimination; Web data mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT '08. IEEE/WIC/ACM International Conference on
  • Conference_Location
    Sydney, NSW
  • Print_ISBN
    978-0-7695-3496-1
  • Type

    conf

  • DOI
    10.1109/WIIAT.2008.241
  • Filename
    4740474