DocumentCode :
3034868
Title :
Towards Compromising Structural and Bag of Words Approaches for Clustering Heterogeneous XML Documents
Author :
Zerida, Nadia ; Yao, Jin
Author_Institution :
GREYC Lab., Univ. of Caen - Ensicaen, Caen
fYear :
2008
fDate :
Sept. 29 2008-Oct. 4 2008
Firstpage :
69
Lastpage :
72
Abstract :
The presence of a large quantity of unlabeled documents on the web increases, and organizing related heterogeneous XML documents by using their structural and conceptual properties into clusters become a great need. In this paper, we consider the pre-processing step as a key step to improve clustering quality, we propose a new pre-processing method which is based on combining Hapax words and path-based descriptors. A constrained agglomerative clustering method is used, and a comparison between different document representations is performed. The effectiveness of the method is evaluated on the INEX corpus, and clustering quality is measured by using micro and macro average purity measures.
Keywords :
XML; pattern clustering; text analysis; Hapax words; bag of words approach; clustering quality; conceptual properties; constrained agglomerative clustering; documents clustering; heterogeneous XML documents; macro average purity measure; micro average purity measure; path based descriptors; structural approach; structural properties; Clustering algorithms; Clustering methods; Computer applications; Fourier transforms; Information management; Information retrieval; Laboratories; Organizing; Testing; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Advanced Engineering Computing and Applications in Sciences, 2008. ADVCOMP '08. The Second International Conference on
Conference_Location :
Valencia
Print_ISBN :
978-0-7695-3369-8
Electronic_ISBN :
978-0-7695-3369-8
Type :
conf
DOI :
10.1109/ADVCOMP.2008.28
Filename :
4640995
Link To Document :
بازگشت