• DocumentCode
    3109167
  • Title

    Duplicate Record Detection for Database Cleansing

  • Author

    Rehman, Mariam ; Esichaikul, Vatcharapon

  • Author_Institution
    Comput. Sci. & Inf. Manage. Program, Asian Inst. of Technol., Pathumthani, Thailand
  • fYear
    2009
  • fDate
    28-30 Dec. 2009
  • Firstpage
    333
  • Lastpage
    338
  • Abstract
    Many organizations collect large amounts of data to support their business and decision making processes. The data collected from various sources may have data quality problems in it. These kinds of issues become prominent when various databases are integrated. The integrated databases inherit the data quality problems that were present in the source database. The data in the integrated systems need to be cleaned for proper decision making. Cleansing of data is one of the most crucial steps. In this research, focus is on one of the major issue of data cleansing i.e. ¿duplicate record detection¿ which arises when the data is collected from various sources. As a result of this research study, comparison among standard duplicate elimination algorithm (SDE), sorted neighborhood algorithm (SNA), duplicate elimination sorted neighborhood algorithm (DE-SNA), and adaptive duplicate detection algorithm (ADD) is provided. A prototype is also developed which shows that adaptive duplicate detection algorithm is the optimal solution for the problem of duplicate record detection. For approximate matching of data records, string matching algorithms (recursive algorithm with word base and recursive algorithm with character base) have been implemented and it is concluded that the results are much better with recursive algorithm with word base.
  • Keywords
    data mining; database management systems; adaptive duplicate detection algorithm; database cleansing; duplicate elimination sorted neighborhood algorithm; duplicate record detection; recursive algorithm; standard duplicate elimination algorithm; string matching algorithm; Computer science; Customer satisfaction; Databases; Decision making; Detection algorithms; Government; Information management; Machine vision; Protection; Prototypes;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Vision, 2009. ICMV '09. Second International Conference on
  • Conference_Location
    Dubai
  • Print_ISBN
    978-0-7695-3944-7
  • Electronic_ISBN
    978-1-4244-5645-1
  • Type

    conf

  • DOI
    10.1109/ICMV.2009.43
  • Filename
    5381140