Title :
Fast Induction of Multiple Decision Trees in Text Categorization from Large Scale, Imbalanced, and Multi-label Data
Author :
Vateekul, Peerapon ; Kubat, Miroslav
Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of Miami, Coral Gables, FL, USA
Abstract :
The paper focuses on automated categorization of text documents, each labeled with one or more classes and described by tens of thousands of features. The computational costs of induction in such domains are so high as almost to disqualify the use of decision trees; the reduction of these costs is thus an important research issue. Our own solution, FDT ("fast decision-tree induction"), uses a two-pronged strategy: (1) feature-set pre-selection, and (2) induction of several trees, each from a different data subset, with the combination of the results from multiple trees with a data-fusion technique tailored to domains with imbalanced classes.
Keywords :
decision trees; text analysis; data fusion technique; fast decision-tree induction; feature-set pre-selection; imbalanced classes; large-scale data; multi-label data; multiple decision trees; text categorization; Cloud computing; Clustering algorithms; Computer networks; Costs; Data mining; Data processing; Decision trees; Large-scale systems; Machine learning algorithms; Text categorization;
Conference_Titel :
Data Mining Workshops, 2009. ICDMW '09. IEEE International Conference on
Conference_Location :
Miami, FL
Print_ISBN :
978-1-4244-5384-9
Electronic_ISBN :
978-0-7695-3902-7
DOI :
10.1109/ICDMW.2009.94