Title :
Clustering homogeneous XML documents using weighted similarities on XML attributes
Author :
Nagwani, Naresh Kumar ; Bhansali, Ashok
Author_Institution :
Dept. of CS&E, NIT, Raipur, India
Abstract :
XML (eXtensible Markup Language) have been adopted by number of software vendors today, it became the standard for data interchange over the web and is platform and application independent also. A XML document is consists of number of attributes like document data, structure and style sheet etc. Clustering is method of creating groups of similar objects. In this paper a weighted similarity measurement approach for detecting the similarity between the homogeneous XML documents is suggested. Using this similarity measurement a new clustering technique is also proposed. The method of calculating similarity of document´s structure and styling is given by number of researchers, mostly which are based on tree edit distances. And for calculating the distance between document´s contents there are number of text and other similarity techniques like cosine, jaccord, tf-idf etc. In this paper both of the similarity techniques are combined to propose a new distance measurement technique for calculating the distance between a pair of homogeneous XML documents. The proposed clustering model is implemened using open source technology java and is validated experimentally. Given a collection of XML documents distances between documents is calculated and stored in the java collections, and then these distances are used to cluster the XML documents.
Keywords :
XML; document handling; pattern clustering; XML attributes; XML clustering; extensible markup language; homogeneous XML documents; similarity measurement; weighted similarities; Application software; Clustering algorithms; Distance measurement; Information retrieval; Java; Software measurement; Software standards; Software testing; Weight measurement; XML; Weighted Similarity; XML Clustering; XML Documents Similarity;
Conference_Titel :
Advance Computing Conference (IACC), 2010 IEEE 2nd International
Conference_Location :
Patiala
Print_ISBN :
978-1-4244-4790-9
Electronic_ISBN :
978-1-4244-4791-6
DOI :
10.1109/IADCC.2010.5422926