• DocumentCode
    3321750
  • Title

    Validating Multi-column Schema Matchings by Type

  • Author

    Dai, Bing Tian ; Koudas, Nick ; Srivastava, Divesh ; Tung, Anthony K H ; Venkatasubramanian, Suresh

  • Author_Institution
    Nat. Univ. of Singapore, Singapore
  • fYear
    2008
  • fDate
    7-12 April 2008
  • Firstpage
    120
  • Lastpage
    129
  • Abstract
    Validation of multi-column schema matchings is essential for successful database integration. This task is especially difficult when the databases to be integrated contain little overlapping data, as is often the case in practice (e.g., customer bases of different companies). Based on the intuition that values present in different columns related by a schema matching will have similar ";semantic type";, and that this can be captured using distributions over values (";statistical types";), we develop a method for validating 1-1 and compositional schema matchings. Our technique is based on three key technical ideas. First, we propose a generic measure for comparing two columns matched by a schema matching, based on a notion of information-theoretic discrepancy that generalizes the standard geometric discrepancy; this provides the basis for 1:1 matching. Second, we present an algorithm for ";splitting"; the string values in a column to identify substrings that are likely to match with the values in another column; this enables (multi-column) l:m schema matching. Third, our technique provides an invalidation certificate if it fails to validate a schema matching. We complement our conceptual and algorithmic contributions with an experimental study that demonstrates the effectiveness and efficiency of our technique on a variety of database schemas and data sets.
  • Keywords
    database management systems; information theory; program verification; type theory; database integration; geometric discrepancy; information-theoretic discrepancy; multi-column schema matchings; semantic type; statistical types; Aggregates; Books; Catalogs; Cities and towns; Databases; Humans; Measurement standards;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on
  • Conference_Location
    Cancun
  • Print_ISBN
    978-1-4244-1836-7
  • Electronic_ISBN
    978-1-4244-1837-4
  • Type

    conf

  • DOI
    10.1109/ICDE.2008.4497420
  • Filename
    4497420