Title :
A genetic algorithm based optimal feature selection for Web page classification
Author_Institution :
Department of Computer Engineering, Ç
fDate :
6/1/2011 12:00:00 AM
Abstract :
In this study we propose a genetic algorithm to select best features for Web page classification problem to improve accuracy and run time performance of the classifiers. The increase in the amount of information on the Web has caused the need for accurate automated classifiers for Web pages to maintain Web directories and to increase search engines´ performance. To determine whether a Web page belongs to a specific class (e.g., a graduate student homepage, a course page, etc.) or not, a classifier needs to have “good” features extracted from the Web pages. As every component in a Web page such as HTML tags and terms can be taken as a feature, dimension of the classification problem becomes too high to be solved by well known classifiers like decision trees, support vector machines, etc. To decrease the feature space, we developed a genetic algorithm that determines the best features for a given set of Web pages. It is found that when features selected by our genetic algorithm are used and a kNN classifier is employed, the accuracy improves up to 96%.
Keywords :
"Web pages","Biological cells","Genetic algorithms","Accuracy","HTML","Feature extraction","Training"
Conference_Titel :
Innovations in Intelligent Systems and Applications (INISTA), 2011 International Symposium on
Print_ISBN :
978-1-61284-919-5
DOI :
10.1109/INISTA.2011.5946076