DocumentCode
57380
Title
Improving Integration Effectiveness of ID Mapping Based Biological Record Linkage
Author
Jamil, Hasan M.
Author_Institution
Dept. of Comput. Sci., Univ. of Idaho, Moscow, ID, USA
Volume
12
Issue
2
fYear
2015
fDate
March-April 2015
Firstpage
473
Lastpage
486
Abstract
Traditionally, biological objects such as genes, proteins, and pathways are represented by a convenient identifier, or ID, which is then used to cross reference, link and describe objects in biological databases. Relationships among the objects are often established using non-trivial and computationally complex ID mapping systems or converters, and are stored in authoritative databases such as UniGene, GeneCards, PIR and BioMart. Despite best efforts, such mappings are largely incomplete and riddled with false negatives. Consequently, data integration using record linkage that relies on these mappings produces poor quality of data, inadvertently leading to erroneous conclusions. In this paper, we discuss this largely ignored dimension of data integration, examine how the ubiquitous use of identifiers in biological databases is a significant barrier to knowledge fusion using distributed computational pipelines, and propose two algorithms for ad hoc and restriction free ID mapping of arbitrary types using online resources. We also propose two declarative statements for ID conversion and data integration based on ID mapping on-the-fly.
Keywords
bioinformatics; computational complexity; data integration; genetics; integration; proteins; BioMart; GeneCards; ID mapping based biological record linkage; ID mapping on-the-fly; PIR; UniGene; ad hoc identifiers; authoritative databases; biological databases; biological objects; computationally complex ID mapping systems; convenient identifier; data integration; data quality; declarative statements; distributed computational pipelines; genes; integration effectiveness; knowledge fusion; nontrivial complex ID mapping systems; online resources; proteins; restriction free ID mapping; Bioinformatics; Computational biology; Data integration; Databases; Genomics; Maintenance engineering; Proteins; ID mapping; computational pipeline; data fusion; declarative query language; on-the-fly data integration; workflow;
fLanguage
English
Journal_Title
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
Publisher
ieee
ISSN
1545-5963
Type
jour
DOI
10.1109/TCBB.2014.2355213
Filename
6892944
Link To Document