• DocumentCode
    57380
  • Title

    Improving Integration Effectiveness of ID Mapping Based Biological Record Linkage

  • Author

    Jamil, Hasan M.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Idaho, Moscow, ID, USA
  • Volume
    12
  • Issue
    2
  • fYear
    2015
  • fDate
    March-April 2015
  • Firstpage
    473
  • Lastpage
    486
  • Abstract
    Traditionally, biological objects such as genes, proteins, and pathways are represented by a convenient identifier, or ID, which is then used to cross reference, link and describe objects in biological databases. Relationships among the objects are often established using non-trivial and computationally complex ID mapping systems or converters, and are stored in authoritative databases such as UniGene, GeneCards, PIR and BioMart. Despite best efforts, such mappings are largely incomplete and riddled with false negatives. Consequently, data integration using record linkage that relies on these mappings produces poor quality of data, inadvertently leading to erroneous conclusions. In this paper, we discuss this largely ignored dimension of data integration, examine how the ubiquitous use of identifiers in biological databases is a significant barrier to knowledge fusion using distributed computational pipelines, and propose two algorithms for ad hoc and restriction free ID mapping of arbitrary types using online resources. We also propose two declarative statements for ID conversion and data integration based on ID mapping on-the-fly.
  • Keywords
    bioinformatics; computational complexity; data integration; genetics; integration; proteins; BioMart; GeneCards; ID mapping based biological record linkage; ID mapping on-the-fly; PIR; UniGene; ad hoc identifiers; authoritative databases; biological databases; biological objects; computationally complex ID mapping systems; convenient identifier; data integration; data quality; declarative statements; distributed computational pipelines; genes; integration effectiveness; knowledge fusion; nontrivial complex ID mapping systems; online resources; proteins; restriction free ID mapping; Bioinformatics; Computational biology; Data integration; Databases; Genomics; Maintenance engineering; Proteins; ID mapping; computational pipeline; data fusion; declarative query language; on-the-fly data integration; workflow;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2014.2355213
  • Filename
    6892944