DocumentCode :
884552
Title :
An efficient and scalable algorithm for clustering XML documents by structure
Author :
Lian, Wang ; Cheung, David Wai-Lok ; Mamoulis, Nikos ; Yiu, Siu-Ming
Author_Institution :
Dept. of Comput. Sci. & Inf. Syst., Hong Kong Univ., China
Volume :
16
Issue :
1
fYear :
2004
Firstpage :
82
Lastpage :
96
Abstract :
With the standardization of XML as an information exchange language over the Internet, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection.
Keywords :
Internet; XML; data mining; graph theory; pattern clustering; query processing; S-GRACE; XML document clustering; XML standardization; cluster mining; data mining; fragmentation problem; hierarchical algorithm; information exchange language; query processing; relational tables; scalable algorithm; semistructured data; structural information; structure graph; tree-edit distance; Clustering algorithms; Computer Society; Information analysis; Inspection; Object oriented databases; Query processing; Relational databases; Standardization; Tree graphs; XML;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2004.1264824
Filename :
1264824
Link To Document :
بازگشت