• DocumentCode
    1993492
  • Title

    Automatic discovery of semantic structures in HTML documents

  • Author

    Mukherjee, Saikat ; Yang, Guizhen ; Tan, Wenfang ; Ramakrishnan, I.V.

  • Author_Institution
    Dept. of Comput. Sci., State Univ. of New York, Stony Brook, NY, USA
  • fYear
    2003
  • fDate
    3-6 Aug. 2003
  • Firstpage
    245
  • Abstract
    Template-driven HTML documents possess an implicit, fixed schema denoting concepts and their relationships in a hierarchical fashion. Discovering this schema remains a relatively unexplored problem. By exploiting a key observation that semantically related items in HTML documents exhibit spatial locality, we develop an algorithm for automatically partitioning them into tree-like semantic structures which expose the implicit schema.
  • Keywords
    Web sites; hypermedia markup languages; text analysis; tree data structures; HTML document; Web sites; automatic partitioning algorithm; automatic semantic structure discovery; spatial locality; tree-like structure; HTML; Text analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on
  • Print_ISBN
    0-7695-1960-1
  • Type

    conf

  • DOI
    10.1109/ICDAR.2003.1227667
  • Filename
    1227667