Useful attributes identification for Unsupervised Information Extraction result set based on REAdaBoost Naïve Bayes

Author

Yin, Wenke ; Zhu, Ming

Author_Institution

Dept. of Autom., Univ. of Sci. & Technol. of China, Hefei, China

Volume

1

fYear

2010

fDate

21-24 May 2010

Abstract

Unsupervised Information Extraction has attracted great attentions in the literature. However, it is inevitable to include useless noise in the result set. Besides, the proportion of useful attributes and the noise in the result set is greatly imbalanced, and the importance of these two types of data is also different. So how to effectively identify the useful attributes becomes an open question. To address this problem, this paper proposes a revised AdaBoost algorithm-REAdaBoost. The weight coefficient of REAdaBoost is not only decided by the precision of useful attributes, but also correlates with the recall for rare attributes. We use Naïve Bayes as the base classifier, and then apply AdaBoost and REAdaBoost to boost it separately. The experiment results show that on the premise of not increasing the overall error rate, REAdaBoost has better performance than AdaBoost and Naïve Bayes in predicting both the useful attributes and the rare attributes.

Keywords

Bayes methods; data mining; pattern classification; AdaBoost algorithm; REAdaBoost naive Bayes; attributes identification; unsupervised information extraction; weight coefficient; 1f noise; Automation; Background noise; Data mining; Error analysis; Explosives; Internet; Large-scale systems; Web pages; Web sites; Classification; Imbalanced Class Distributions; InformationExtraction; REAdaBoost;

fLanguage

English

Publisher

ieee

Conference_Titel

Future Computer and Communication (ICFCC), 2010 2nd International Conference on

Conference_Location

Wuhan

Print_ISBN

978-1-4244-5821-9

Type

conf

DOI

10.1109/ICFCC.2010.5497739

Filename

5497739