Title :
A novel algorithm for technical articles classification based on gene selection
Author :
Kilany, R. ; Ammar, Reda ; Rajasekaran, Sanguthevar
Author_Institution :
Comput. Sci. & Eng. Dept., Univ. of Connecticut, Storrs, CT, USA
Abstract :
Research in science and engineering has resulted in the generation of voluminous datasets. For instance, biological databases such as PubMed now have millions of articles. Given this growth in data, the problem of retrieving information relevant to a specific topic has become a big challenge. In this paper we focus on the problem of retrieving articles pertaining to a given topic from among a huge collection of articles. In particular, we investigate the problem of classifying articles. Though numerous techniques and tools are available for documents classification, a shortcoming in them is that they take too much time. In this paper we present generic computational techniques that can classify articles efficiently. Our algorithms are based on algorithms that have been proposed for a related problem called gene selection. Gene selection is the problem of identifying a minimum set of genes that are responsible for certain events (for example the presence of cancer). Even though gene selection was originally proposed for biological data analysis, the technique itself is generic. For example, `genes´ can be thought of as generic variable. A typical tool that we envision will take as input a set of keywords (that characterize the information of interest) and will develop a learner that will identify a small subset of the keywords that are capable of classifying papers into two types. A paper is of the first type if it has information of interest and a paper is of the second type if the paper does not have information of interest. Experiments show that the new algorithm obtains a higher classification accuracy using a smaller number of selected keywords when compared to one of the best algorithms reported in the literature.
Keywords :
data mining; information retrieval; learning (artificial intelligence); pattern classification; support vector machines; text analysis; article collection; article retrieval; data growth; document classification; gene selection; generic computational technique; generic variable; information characterization; information retrieval; keyword selection; keyword subset identification; learning; paper classification; technical article classification; text mining; voluminous dataset; Accuracy; Algorithm design and analysis; Classification algorithms; Correlation; Kernel; Support vector machines; Training; Data Minimg; SVM; document classification; text categorization;
Conference_Titel :
Computers and Communications (ISCC), 2012 IEEE Symposium on
Conference_Location :
Cappadocia
Print_ISBN :
978-1-4673-2712-1
Electronic_ISBN :
1530-1346
DOI :
10.1109/ISCC.2012.6249300