• DocumentCode
    922165
  • Title

    Efficient classification across multiple database relations: a CrossMine approach

  • Author

    Yin, Xiaoxin ; Han, Jiawei ; Yang, Jiong ; Yu, Philip S.

  • Author_Institution
    Dept. of Comput. Sci., Illinois Univ., Urbana, IL
  • Volume
    18
  • Issue
    6
  • fYear
    2006
  • fDate
    6/1/2006 12:00:00 AM
  • Firstpage
    770
  • Lastpage
    783
  • Abstract
    Relational databases are the most popular repository for structured data, and is thus one of the richest sources of knowledge in the world. In a relational database, multiple relations are linked together via entity-relationship links. Multirelational classification is the procedure of building a classifier based on information stored in multiple relations and making predictions with it. Existing approaches of inductive logic programming (recently, also known as relational mining) have proven effective with high accuracy in multirelational classification. Unfortunately, most of them suffer from scalability problems with regard to the number of relations in databases. In this paper, we propose a new approach, called CrossMine, which includes a set of novel and powerful methods for multirelational classification, including 1) tuple ID propagation, an efficient and flexible method for virtually joining relations, which enables convenient search among different relations, 2) new definitions for predicates and decision-tree nodes, which involve aggregated information to provide essential statistics for classification, and 3) a selective sampling method for improving scalability with regard to the number of tuples. Based on these techniques, we propose two scalable and accurate methods for multirelational classification: CrossMine-Rule, a rule-based method and CrossMine-Tree, a decision-tree-based method. Our comprehensive experiments on both real and synthetic data sets demonstrate the high scalability and accuracy of the CrossMine approach
  • Keywords
    data mining; decision trees; entity-relationship modelling; inductive logic programming; pattern classification; relational databases; sampling methods; CrossMine-Rule; CrossMine-Tree; decision-tree nodes; entity-relationship links; inductive logic programming; multiple database relation; multirelational classification; relational database; selective sampling method; structured data; tuple ID propagation; Buildings; Credit cards; Data analysis; Decision making; Logic programming; Relational databases; Sampling methods; Scalability; Spatial databases; Statistics; Data mining; classification; relational databases.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2006.94
  • Filename
    1626232