DocumentCode :
2966687
Title :
Clustering Heterogeneous Web Data using Clustering by Compression. Cluster Validity
Author :
Cernian, Alexandra ; Carstoiu, Dorin ; Olteanu, Adriana
Author_Institution :
Fac. of Autom. Control & Comput. Sci., Politeh. Univ. of Bucharest, Bucharest, Romania
fYear :
2008
fDate :
26-29 Sept. 2008
Firstpage :
123
Lastpage :
126
Abstract :
The expansive nature of the Internet produced a vast quantity of unstructured data, compared to our conception of a conventional data base. The application of clustering on the World Wide Web is essential to get structured information from this sea of information. In this paper, we intend to test the results of a new clustering technique - clustering by compression - when applied to heterogeneous sets of data. The clustering by compression procedure is based on a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pair-wise concatenation). In order to validate the results, we calculate some quality indices. If the values we obtain prove a high quality of the clustering, in the near future we plan to include the clustering by compression technique into a framework for clustering heterogeneous Web objects.
Keywords :
Internet; data compression; pattern clustering; Internet; World Wide Web; cluster validity; data compression; heterogeneous Web data clustering; heterogeneous Web object clustering; normalized compression distance; structured information; unstructured data; Automatic control; Clustering algorithms; Clustering methods; Collaboration; Information filtering; Information filters; Internet; Scientific computing; Testing; Web sites; cluster validity; clustering; heterogeneous data;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Symbolic and Numeric Algorithms for Scientific Computing, 2008. SYNASC '08. 10th International Symposium on
Conference_Location :
Timisoara
Print_ISBN :
978-0-7695-3523-4
Type :
conf
DOI :
10.1109/SYNASC.2008.64
Filename :
5204799
Link To Document :
بازگشت