Title :
Scalable Rule-Based Gene Expression Data Classification
Author :
Iwen, Mark A. ; Lang, Willis ; Patel, Jignesh M.
Author_Institution :
Dept. of Math., Univ. of Michigan, Ann Arbor, MI
Abstract :
Current state-of-the-art association rule-based classifiers for gene expression data operate in two phases: (i) Association rule mining from training data followed by (ii) Classification of query data using the mined rules. In the worst case, these methods require an exponential search over the subset space of the training data set´s samples and/or genes during at least one of these two phases. Hence, existing association rule-based techniques are prohibitively computationally expensive on large gene expression datasets. Our main result is the development of a heuristic rule-based gene expression data classifier called Boolean Structure Table Classification (BSTC). BSTC is explicitly related to association rule-based methods, but is guaranteed to be polynomial space/time. Extensive cross validation studies on several real gene expression datasets demonstrate that BSTC retains the classification accuracy of current association rule-based methods while being orders of magnitude faster than the leading classifier RCBT on large datasets. As a result, BSTC is able to finish table generation and classification on large datasets for which current association rule-based methods become computationally infeasible. BSTC also enjoys two other advantages over association rule-based classifiers: (i) BSTC is easy to use (requires no parameter tuning), and (ii) BSTC can easily handle datasets with any number of class types. Furthermore, in the process of developing BSTC we introduce a novel class of Boolean association rules which have potential applications to other data mining problems.
Keywords :
Boolean functions; biology computing; data mining; pattern classification; Boolean association rules; Boolean structure table classification; association rule mining; association rule-based classifiers; association rule-based techniques; data mining; gene expression data classifier; gene expression datasets; polynomial space-time; query data; scalable rule-based gene expression data classification; Association rules; Cancer; Classification tree analysis; Costs; Data mining; Gene expression; Mathematics; Support vector machine classification; Support vector machines; Training data;
Conference_Titel :
Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on
Conference_Location :
Cancun
Print_ISBN :
978-1-4244-1836-7
Electronic_ISBN :
978-1-4244-1837-4
DOI :
10.1109/ICDE.2008.4497515