Title :
Protein Sequence Classification Using Feature Hashing
Author :
Caragea, Cornelia ; Silvescu, Adrian ; Mitra, Prasenjit
Author_Institution :
Inf. Sci. & Technol., Pennsylvania State Univ., University Park, PA, USA
Abstract :
Recent advances in next-generation sequencing technologies have resulted in an exponential increase in protein sequence data. The k-gram representation, used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. We study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by mapping features to hash keys, such that multiple features can be mapped (at random) to the same key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" and feature selection approaches. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.
Keywords :
bioinformatics; data mining; feature extraction; learning (artificial intelligence); pattern classification; proteins; bag of k-grams; data mining; dimensionality reduction; feature hashing; feature selection; k-gram representation; learning algorithm; next-generation sequencing technologies; protein sequence classification; Accuracy; Mutual information; Protein sequence; Support vector machines; Vectors; Vocabulary; dimensionality reduction; feature hashing; variable length k-grams;
Conference_Titel :
Bioinformatics and Biomedicine (BIBM), 2011 IEEE International Conference on
Conference_Location :
Atlanta, GA
Print_ISBN :
978-1-4577-1799-4
DOI :
10.1109/BIBM.2011.91