• DocumentCode
    476082
  • Title

    An unsupervised learning framework for discovering the site-specific ontology from multiple Web pages

  • Author

    Tak-Lam Wong ; Chow, Kai-on ; Wang, Fu Lee

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Chinese Univ. of Hong Kong, Hong Kong
  • Volume
    3
  • fYear
    2008
  • fDate
    12-15 July 2008
  • Firstpage
    1598
  • Lastpage
    1603
  • Abstract
    We develop an unsupervised learning framework for tackling the problem of automatic site-specific ontology discovery from multiple pages of a Web site. To harness the uncertainty involved, our framework is designed based on a generative model which models the generation of text fragments contained in the pages of a Web site. One characteristic of our framework is that we consider clues from multiple pages collected from the Web site. Another characteristic is that we learn the regularities of the layout format to discover the site-specific ontology via stochastic grammatical inference. To accomplish the goal of ontology discovery, the ontology information blocks of a Web page are identified by making use of the site invariant information. We have conducted extensive experiments using real-world Web sites. Comparisons between existing methods and our framework have been carried out to demonstrate the effectiveness of our framework.
  • Keywords
    Internet; Web sites; data mining; ontologies (artificial intelligence); stochastic processes; unsupervised learning; multiple Web pages; site invariant information; site-specific ontology discovery; stochastic grammatical inference; unsupervised learning; Computer science; Cybernetics; Humans; Intelligent agent; Internet; Machine learning; Ontologies; Semantic Web; Unsupervised learning; Web pages; Ontology; Text mining; Web mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics, 2008 International Conference on
  • Conference_Location
    Kunming
  • Print_ISBN
    978-1-4244-2095-7
  • Electronic_ISBN
    978-1-4244-2096-4
  • Type

    conf

  • DOI
    10.1109/ICMLC.2008.4620661
  • Filename
    4620661