• DocumentCode
    2736868
  • Title

    Automatic Hypertext Table Understanding by using Logical Structure Description Algorithm

  • Author

    Huang, Chiung-Wei ; Chien, Chih-Yuan ; Hsu, Chun-Nan ; Lee, Hahn-Ming

  • Author_Institution
    Nat. Taiwan Univ. of Sci. & Technol., Taipei
  • fYear
    2007
  • fDate
    5-7 Sept. 2007
  • Firstpage
    247
  • Lastpage
    247
  • Abstract
    Due to focusing on template matching, conventional approaches bound their capability by complex and varied layout structures. This paper proposes a novel and efficient logical structure description algorithm, named structure description algorithm, to automatically extract logical structures from hypertext (Web) tables. Based on table field relationships, our approach starts from each data cell to search leftward and upward for its correlated headers. After that, rules for describing logical structure can be generated without defining the layout structure pattern in advance. In addition through the help of a table translation strategy, our method outputs a relational table which can be fed into a SQL database directly for information query and processing. Experimental results show that proposed method not only retains the logical structure in output relational table, but also outperforms two major methods on handling very complex Web tables.
  • Keywords
    Internet; SQL; query processing; relational databases; SQL database; Web tables; automatic hypertext table understanding; information query; logical structure description algorithm; structure description algorithm; template matching; Computer science; Data mining; HTML; Hydrogen; Information processing; Information retrieval; Pattern matching; Production; Relational databases; Service oriented architecture;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Innovative Computing, Information and Control, 2007. ICICIC '07. Second International Conference on
  • Conference_Location
    Kumamoto
  • Print_ISBN
    0-7695-2882-1
  • Type

    conf

  • DOI
    10.1109/ICICIC.2007.190
  • Filename
    4427892