DocumentCode :
2983874
Title :
Estimating the Expected Effectiveness of Text Classification Solutions under Subclass Distribution Shifts
Author :
Lipka, N. ; Stein, Bernardo ; Shanahan, J.G.
Author_Institution :
Bauhaus-Univ. Weimar, Weimar, Germany
fYear :
2012
fDate :
10-13 Dec. 2012
Firstpage :
972
Lastpage :
977
Abstract :
Automated text classification is one of the most important learning technologies to fight information overload. However, the information society is not only confronted with an information flood but also with an increase in "information volatility", by which we understand the fact that kind and distribution of a data source\´s emissions can significantly vary. In this paper we show how to estimate the expected effectiveness of a classification solution when the underlying data source undergoes a shift in the distribution of its subclasses (modes). Subclass distribution shifts are observed among others in online media such as tweets, blogs, or news articles, where document emissions follow topic popularity. To estimate the expected effectiveness of a classification solution we partition a test sample by means of clustering. Then, using repetitive resampling with different margin distributions over the clustering, the effectiveness characteristics is studied. We show that the effectiveness is normally distributed and introduce a probabilistic lower bound that is used for model selection. We analyze the relation between our notion of expected effectiveness and the mean effectiveness over the clustering both theoretically and on standard text corpora. An important result is a heuristic for expected effectiveness estimation that is solely based on the initial test sample and that can be computed without resampling.
Keywords :
pattern classification; pattern clustering; sampling methods; statistical distributions; text analysis; expected effectiveness estimation; information flood; information overload; information volatility; learning technology; margin distribution; model selection; probabilistic lower bound; repetitive resampling; subclass distribution shift; text classification solution; topic popularity; Clustering algorithms; Estimation; Machine learning; Mathematical model; Media; Standards; Vectors; Classification; Concept Drift; Model Selection; clustering; unknown distributions;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining (ICDM), 2012 IEEE 12th International Conference on
Conference_Location :
Brussels
ISSN :
1550-4786
Print_ISBN :
978-1-4673-4649-8
Type :
conf
DOI :
10.1109/ICDM.2012.89
Filename :
6413823
Link To Document :
بازگشت