CrossMine: efficient classification across multiple database relations

Author

Yin, Xiaoxin ; Han, Jiawei ; Yang, Jiong ; Philip, S.Yu.

fYear

2004

fDate

30 March-2 April 2004

Firstpage

399

Lastpage

410

Abstract

Most of today\´s structured data is stored in relational databases. Such a database consists of multiple relations which are linked together conceptually via entity-relationship links in the design of relational database schemas. Multirelational classification can be widely used in many disciplines, such as financial decision-making, medical research, and geographical applications. However, most classification approaches only work on single "flat" data relations. It is usually difficult to convert multiple relations into a single flat relation without either introducing huge, undesirable "universal relation" or losing essential information. Previous works using inductive logic programming approaches (recently also known as relational mining) have proven effective with high accuracy in multi-relational classification. Unfortunately, they suffer from poor scalability w.r.t. the number of relations and the number of attributes in databases. We propose CrossMine, an efficient and scalable approach for multirelational classification. Several novel methods are developed in CrossMine, including (1) tuple ID propagation, which performs semantics-preserving virtual join to achieve high efficiency on databases with complex schemas, and (2) a selective sampling method, which makes it highly scalable w.r.t. the number of tuples in the databases. Both theoretical backgrounds and implementation techniques of CrossMine are introduced. Our comprehensive experiments on both real and synthetic databases demonstrate the high scalability and accuracy of CrossMine.

Keywords

data mining; entity-relationship modelling; pattern classification; relational databases; sampling methods; CrossMine; entity-relationship model; machine learning; multiple database relation classification; real database; relational database; sampling method; synthetic database; tuple ID propagation; Credit cards; Decision making; Erbium; Logic programming; Neural networks; Relational databases; Sampling methods; Scalability; Support vector machine classification; Support vector machines;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Engineering, 2004. Proceedings. 20th International Conference on

ISSN

1063-6382

Print_ISBN

0-7695-2065-0

Type

conf

DOI

10.1109/ICDE.2004.1320014

Filename

1320014