• DocumentCode
    2335943
  • Title

    Mining the Web with active hidden Markov models

  • Author

    Scheffer, Tobias ; Decomain, Christian ; Wrobel, Stefan

  • Author_Institution
    Univ. of Magdeburg, Germany
  • fYear
    2001
  • fDate
    2001
  • Firstpage
    645
  • Lastpage
    646
  • Abstract
    Given the enormous amounts of information available only in unstructured or semi-structured textual documents, tools for information extraction (IE) have become enormously important. IE tools identify the relevant information in such documents and convert it into a structured format such as a database or an XML document. While first IE algorithms were hand-crafted sets of rules, researchers soon turned to learning extraction rules from hand-labeled documents. Unfortunately, rule-based approaches sometimes fail to provide the necessary robustness against the inherent variability of document, structure, which has led to the recent interest in using hidden Markov models (HMMs). By using additional unlabeled documents as they are usually readily available in most applications, we can perform active learning of HMMs. The idea of active learning algorithms is to identify unlabeled observations that would be most useful when labeled by the user. Such algorithms are known for classification, clustering, and regression; we present the first algorithm for active learning of hidden Markov models
  • Keywords
    data mining; hidden Markov models; information resources; information retrieval; learning (artificial intelligence); Web mining; active hidden Markov models; active learning; information extraction; semi-structured textual documents; unlabeled documents; unstructured textual documents; Clustering algorithms; Data mining; Databases; Hidden Markov models; Probability; Robustness; Sequences; Speech recognition; Tin; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on
  • Conference_Location
    San Jose, CA
  • Print_ISBN
    0-7695-1119-8
  • Type

    conf

  • DOI
    10.1109/ICDM.2001.989591
  • Filename
    989591