Title : 
Identifying candidate genes using the BioWarehouse: a case study
         
        
            Author : 
Pouliot, Yannick ; Lee, Thomas J. ; Wagner, Valerie ; Karp, Peter D.
         
        
            Author_Institution : 
Bioinformatics Res. Group, Menlo Park, CA, USA
         
        
        
        
        
        
            Abstract : 
The BioWarehouse is an open source data warehousing environment focused on supporting bioinformatics databases (DBs). Operating on the MySQL or Oracle relational database management systems (RDBMSs), BioWarehouse integrates public source DBs such as Swiss-Prot and GenBank into a unified normalized schema operating under a single DB management system. BioWarehouse also imposes partial semantic normalization on the source data, thus decreasing semantic heterogeneity and facilitating multi-DB queries using the Structured Query Language (SQL). As an application case study of the BioWarehouse, we have identified candidate genes for "orphan" activities, defined as activities for which no cognate gene sequences exist. 1,356 (36%) of enzymatic activities that have been assigned an enzyme commission (EC) number are orphans (Karp, 2004). Such high prevalence is problematic, given that many of these activities are decades old and often perform essential functions. Most notably, the existence of orphans introduces gaps in sequence data that significantly limit the accuracy of genome annotation and metabolic pathway prediction. Fortunately, with more than 200 hundred genomes sequenced to completion, and with the availability of systems such as BioWarehouse, the computational identification of candidate genes associated with orphan activities can be envisioned. The BioWarehouse\´s conglomeration of databases, combined with Oracle 10g\´s native integration of analytical tools into SQL queries (such as the basic local alignment search tool (BLAST) and POSIX regular expressions), enabled us to identify a small number of high-confidence candidate genes associated with a specific orphan activity. We describe the complex queries used in this work to illustrate the value of the data warehousing approach to bioinformatics research.
         
        
            Keywords : 
SQL; biology computing; data warehouses; genetics; public domain software; query processing; relational databases; scientific information systems; C language; Java language; SQL query; Structured Query Language; analytical tools integration; bioinformatics database support; cognate gene sequence; computational candidate gene identification; database conglomeration; database integration; database management system; database multiple version; database parsing; enzymatic activity; enzyme commission number; genome annotation; genome sequencing; metabolic pathway prediction; multidatabase query; open source data warehousing environment; partial semantic normalization; public source database; semantic heterogeneity; simultaneous storage; Biochemistry; Bioinformatics; Computer aided software engineering; Database languages; Genomics; Java; Organisms; Relational databases; Uniform resource locators; Warehousing;
         
        
        
        
            Conference_Titel : 
Systems Engineering, 2005. ICSEng 2005. 18th International Conference on
         
        
            Print_ISBN : 
0-7695-2359-5
         
        
        
            DOI : 
10.1109/ICSENG.2005.47