Title :
Feature vector construction combining structure and content for document classification
Author :
Chagheri, S. ; Calabretto, Sylvie ; Roussey, C. ; Dumoulin, Cedric
Author_Institution :
INSA de Lyon, Univ. de LYON, Lyon, France
Abstract :
This paper describes a representation for XML documents in order to classify them. Document classification is based on document representation techniques. More relevant the representation phase is, more relevant the classification will be. We propose a representation model that exploits both the logical structure and the content of document. Structure is represented by the tags of XML document. Our approach is based on vector space model: a document is represented by a vector of weighted features. Each feature is a couple of (tag: term). We have modified the tf*idf formula to calculate feature´s weight according to term´s structural level in the document. SVM has been used as learning algorithm. Experimentation on Reuters collection shows that our proposition improves classification performance compared to the standard classification model based on term vector.
Keywords :
XML; document handling; learning (artificial intelligence); pattern classification; support vector machines; Reuters collection; SVM; XML documents; document classification; document representation techniques; feature vector construction; learning algorithm; logical structure; term vector; tf*idf formula; vector space model; weighted features; Classification algorithms; Feature extraction; Kernel; Support vector machine classification; Vectors; XML; document classification; structured document; support vector machine; vector space model;
Conference_Titel :
Sciences of Electronics, Technologies of Information and Telecommunications (SETIT), 2012 6th International Conference on
Conference_Location :
Sousse
Print_ISBN :
978-1-4673-1657-6
DOI :
10.1109/SETIT.2012.6482041