DocumentCode
3034868
Title
Towards Compromising Structural and Bag of Words Approaches for Clustering Heterogeneous XML Documents
Author
Zerida, Nadia ; Yao, Jin
Author_Institution
GREYC Lab., Univ. of Caen - Ensicaen, Caen
fYear
2008
fDate
Sept. 29 2008-Oct. 4 2008
Firstpage
69
Lastpage
72
Abstract
The presence of a large quantity of unlabeled documents on the web increases, and organizing related heterogeneous XML documents by using their structural and conceptual properties into clusters become a great need. In this paper, we consider the pre-processing step as a key step to improve clustering quality, we propose a new pre-processing method which is based on combining Hapax words and path-based descriptors. A constrained agglomerative clustering method is used, and a comparison between different document representations is performed. The effectiveness of the method is evaluated on the INEX corpus, and clustering quality is measured by using micro and macro average purity measures.
Keywords
XML; pattern clustering; text analysis; Hapax words; bag of words approach; clustering quality; conceptual properties; constrained agglomerative clustering; documents clustering; heterogeneous XML documents; macro average purity measure; micro average purity measure; path based descriptors; structural approach; structural properties; Clustering algorithms; Clustering methods; Computer applications; Fourier transforms; Information management; Information retrieval; Laboratories; Organizing; Testing; XML;
fLanguage
English
Publisher
ieee
Conference_Titel
Advanced Engineering Computing and Applications in Sciences, 2008. ADVCOMP '08. The Second International Conference on
Conference_Location
Valencia
Print_ISBN
978-0-7695-3369-8
Electronic_ISBN
978-0-7695-3369-8
Type
conf
DOI
10.1109/ADVCOMP.2008.28
Filename
4640995
Link To Document