DocumentCode :
2710199
Title :
Learning the kernel matrix for XML document clustering
Author :
Yang, Jianwu ; Cheung, William K. ; Chen, Xiaoou
Author_Institution :
Inst. of Comput. Sci. & Technol., Peking Univ., Beijing, China
fYear :
2005
fDate :
29 March-1 April 2005
Firstpage :
353
Lastpage :
358
Abstract :
The rapid growth of XML adoption has urged for the need of a proper representation for semi-structured documents, where the document structural information has to be taken into account so as to support more precise document analysis. In this paper, an XML document representation named "structured link vector model" is adopted, with a kernel matrix included for modeling the similarity between XML elements. Our formulation allows individual XML elements to have their own weighted contribution to the overall document similarity while at the same time allows the between-element similarity to be captured. An iterative algorithm is derived to learn the kernel matrix. For performance evaluation, the ACM SIGMOD record dataset as well as the CEDE dataset have been tested. Our proposed method outperforms significantly the traditional vector space model and the edit-distance based methods. In addition, the kernel matrix obtained as a by-product provides knowledge about the conceptual relationship between the XML elements.
Keywords :
XML; data mining; learning (artificial intelligence); pattern clustering; ACM SIGMOD record dataset; CEDE dataset; XML document clustering; document structural information; edit-distance based method; iterative algorithm; kernel matrix; performance evaluation; semistructured documents; structured link vector model; Computer science; Fourier transforms; Information analysis; Iterative algorithms; Kernel; Testing; Text analysis; Training data; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
e-Technology, e-Commerce and e-Service, 2005. EEE '05. Proceedings. The 2005 IEEE International Conference on
Print_ISBN :
0-7695-2274-2
Type :
conf
DOI :
10.1109/EEE.2005.87
Filename :
1402321
Link To Document :
بازگشت