مرکز منطقه ای اطلاع رساني علوم و فناوري - Estimating the Expected Effectiveness of Text Classification Solutions under Subclass Distribution Shifts

DocumentCode :

2983874

Title :

Estimating the Expected Effectiveness of Text Classification Solutions under Subclass Distribution Shifts

Author :

Lipka, N. ; Stein, Bernardo ; Shanahan, J.G.

Author_Institution :

Bauhaus-Univ. Weimar, Weimar, Germany

fYear :

2012

fDate :

10-13 Dec. 2012

Firstpage :

972

Lastpage :

977

Abstract :

Automated text classification is one of the most important learning technologies to fight information overload. However, the information society is not only confronted with an information flood but also with an increase in "information volatility", by which we understand the fact that kind and distribution of a data source\´s emissions can significantly vary. In this paper we show how to estimate the expected effectiveness of a classification solution when the underlying data source undergoes a shift in the distribution of its subclasses (modes). Subclass distribution shifts are observed among others in online media such as tweets, blogs, or news articles, where document emissions follow topic popularity. To estimate the expected effectiveness of a classification solution we partition a test sample by means of clustering. Then, using repetitive resampling with different margin distributions over the clustering, the effectiveness characteristics is studied. We show that the effectiveness is normally distributed and introduce a probabilistic lower bound that is used for model selection. We analyze the relation between our notion of expected effectiveness and the mean effectiveness over the clustering both theoretically and on standard text corpora. An important result is a heuristic for expected effectiveness estimation that is solely based on the initial test sample and that can be computed without resampling.

Keywords :

pattern classification; pattern clustering; sampling methods; statistical distributions; text analysis; expected effectiveness estimation; information flood; information overload; information volatility; learning technology; margin distribution; model selection; probabilistic lower bound; repetitive resampling; subclass distribution shift; text classification solution; topic popularity; Clustering algorithms; Estimation; Machine learning; Mathematical model; Media; Standards; Vectors; Classification; Concept Drift; Model Selection; clustering; unknown distributions;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Mining (ICDM), 2012 IEEE 12th International Conference on

Conference_Location :

Brussels

ISSN :

1550-4786

Print_ISBN :

978-1-4673-4649-8

Type :

conf

DOI :

10.1109/ICDM.2012.89

Filename :

6413823

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2983874