• DocumentCode
    2507818
  • Title

    Text joins for data cleansing and integration in an RDBMS

  • Author

    Gravano, Luis ; Ipeirotis, Panagiotis G. ; Koudas, Nick ; Srivastava, Divesh

  • Author_Institution
    Columbia Univ., NY, USA
  • fYear
    2003
  • fDate
    5-8 March 2003
  • Firstpage
    729
  • Lastpage
    731
  • Abstract
    An organization\´s data records are often noisy because of transcription errors, incomplete information, lack of standard formats for textual data or combinations thereof. A fundamental task in a data cleaning system is matching textual attributes that refer to the same entity (e.g., organization name or address). This matching is effectively performed via the cosine similarity metric from the information retrieval field. For robustness and scalability, these "text joins" are best done inside an RDBMS, which is where the data is likely to reside. Unfortunately, computing an exact answer to a text join can be expensive. We propose an approximate, sampling-based text join execution strategy that can be robustly executed in a standard, unmodified RDBMS.
  • Keywords
    data integrity; query processing; relational databases; string matching; cosine similarity metric; data cleaning system; data integration; information retrieval; relational DBMS; sampling-based text join; textual attribute matching; Algorithm design and analysis; Cleaning; Databases; Information retrieval; Robustness; Scalability; Standards organizations;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2003. Proceedings. 19th International Conference on
  • Print_ISBN
    0-7803-7665-X
  • Type

    conf

  • DOI
    10.1109/ICDE.2003.1260850
  • Filename
    1260850