Handling Datasets in a Multi-Relational Environment: Cluster Dispersion vs Cluster Purity

Author

Alfred, Rayner ; Kazakov, Dimitar

Author_Institution

Univ. Malaysia Sabah, Kota Kinabalu

fYear

2007

fDate

6-8 Sept. 2007

Firstpage

196

Lastpage

201

Abstract

Clustering multiple-instances in a multi-relational environment requires data transformations (e.g. data aggregation) from datasets stored in multiple tables into a single table. Unfortunately, most relational databases are limited to a few basic methods of aggregation (e.g. max, min, sum, count, ave) to aggregate continuous and categorical values. Therefore, data transformation is limited only to aggregation of continuous and categorical values. In this paper, to get the best number of clusters, we propose a genetic semi-supervised clustering technique as a means of aggregating data stored in multiple tables. This algorithm is suitable for classification of datasets with a high degree of one-to-many associations, in which a single record has multiple instances that are associated with it. The clustering algorithm can be used in two ways. One is the unsupervised clustering, where the user may control the result of clustering by optimizing the value of cluster dispersion. The other is a semi-supervised clustering, where the user uses an unsupervised clustering method optimized with a genetic algorithm incorporating a measure of classification accuracy used in decision tree algorithm, the GINI index. In this paper, we examine both methods to dynamically cluster multiple instances, as a means of aggregating them, and illustrate the effectiveness of the semi-supervised genetic algorithm-based clustering technique.

Keywords

database indexing; decision trees; genetic algorithms; pattern classification; pattern clustering; relational databases; unsupervised learning; GINI index; classification algorithm; cluster dispersion; cluster purity; data aggregation; data transformation; decision tree algorithm; genetic semisupervised clustering technique; multiple-instance clustering; multirelational database environment; unsupervised clustering method; Aggregates; Clustering algorithms; Clustering methods; Conferences; Data acquisition; Decision trees; Genetic algorithms; Optimization methods; Relational databases; Supervised learning; Clustering; Data Aggregation; Data Pre-processin; Genetic Algorithm; Relational Data Mining; Semi-supervised Clustering;

fLanguage

English

Publisher

ieee

Conference_Titel

Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, 2007. IDAACS 2007. 4th IEEE Workshop on

Conference_Location

Dortmund

Print_ISBN

978-1-4244-1347-8

Electronic_ISBN

978-1-4244-1348-5

Type

conf

DOI

10.1109/IDAACS.2007.4488404

Filename

4488404