Mining the Web with active hidden Markov models

Author

Scheffer, Tobias ; Decomain, Christian ; Wrobel, Stefan

Author_Institution

Univ. of Magdeburg, Germany

fYear

2001

fDate

2001

Firstpage

645

Lastpage

646

Abstract

Given the enormous amounts of information available only in unstructured or semi-structured textual documents, tools for information extraction (IE) have become enormously important. IE tools identify the relevant information in such documents and convert it into a structured format such as a database or an XML document. While first IE algorithms were hand-crafted sets of rules, researchers soon turned to learning extraction rules from hand-labeled documents. Unfortunately, rule-based approaches sometimes fail to provide the necessary robustness against the inherent variability of document, structure, which has led to the recent interest in using hidden Markov models (HMMs). By using additional unlabeled documents as they are usually readily available in most applications, we can perform active learning of HMMs. The idea of active learning algorithms is to identify unlabeled observations that would be most useful when labeled by the user. Such algorithms are known for classification, clustering, and regression; we present the first algorithm for active learning of hidden Markov models

Keywords

data mining; hidden Markov models; information resources; information retrieval; learning (artificial intelligence); Web mining; active hidden Markov models; active learning; information extraction; semi-structured textual documents; unlabeled documents; unstructured textual documents; Clustering algorithms; Data mining; Databases; Hidden Markov models; Probability; Robustness; Sequences; Speech recognition; Tin; XML;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on

Conference_Location

San Jose, CA

Print_ISBN

0-7695-1119-8

Type

conf

DOI

10.1109/ICDM.2001.989591

Filename

989591