Title :
Hybrid text mining model for document classification
Author :
Vidhya, K.A. ; Aghila, G.
Author_Institution :
Dept. of Comput. Sci., Pondicherry Univ., Pondicherry, India
Abstract :
This work proposes a hybrid model for text document classification for information retrieval using Naive Bayes and Rough set theory. Rough set theory is used for feature reduction and Naive Bayes theorem is used for classification of documents into the predefined categories by means of the probabilistic values. The deployment of the proposed model is planned through an enhanced method of the utilization of the Naive Bayes approach and rough set theory to overcome the imprecision and vagueness in data set thus improving the classification accuracy. In Naive Bayes model, the word probabilities for a class are estimated by calculating the likelihood in the entire training documents where the training and test data are split randomly into k-subsets like 2/3 for training and 1/3 for test data. In addition, it also utilizes two level hierarchy structures for training documents like features from title, keywords and content with the predefined knowledge available. The rough set model includes the feature reduction technique through which the number of features for classification is reduced aiming at an optimal classification of text document.
Keywords :
Bayes methods; classification; data mining; information retrieval; probability; rough set theory; text analysis; feature reduction; hybrid text mining model; information retrieval; naive Bayes; probabilistic value; rough set theory; text document classification; training document; word probability; Computer science; Data mining; Information retrieval; Machine learning; Machine learning algorithms; Probability; Rough sets; Set theory; Testing; Text mining; Feature Reduction; Feature Selection; Naïve Bayes; Rough sets; Text Mining;
Conference_Titel :
Computer and Automation Engineering (ICCAE), 2010 The 2nd International Conference on
Conference_Location :
Singapore
Print_ISBN :
978-1-4244-5585-0
Electronic_ISBN :
978-1-4244-5586-7
DOI :
10.1109/ICCAE.2010.5451965