Title :
Data management and analysis for high-throughput DNA sequencing projects
Author :
Kerlavage, A.R. ; FitzHugh, Will ; Gladek, A. ; Kelley, John ; Scott, John ; Shirley, Robert ; Sutton, Granger ; Wai-Chiu, Man ; White, Owen ; Adams, David
Author_Institution :
Dept. of Bioinf., Inst. for Genomic Res., Gaithersburg, MD, USA
Abstract :
The rapid advances in molecular biology have begun to shift many of the bottlenecks in genome research from the laboratory to the data analysis facility. The pace at which this has occurred creates a situation in which software development always has to catch up with the flow of data. Since such large-scale processes were not anticipated, the analysis infrastructure has not been fully established. Furthermore, most systems that have been built were designed by the biologists who collected the data. More recently, computer scientists, mathematicians, and engineers have taken an interest in this problem. This has had a positive effect, since it has created a tight synergy between the informatics and the biology. Several principles affected the design of the system developed at TIGR. Each of the sample preparation, sequencing, and analysis steps had to be managed, scheduled, and tracked. This information had to be made readily available to those who needed it for carrying out their tasks. Different skill levels of the users had to be taken into account. The degree of human intervention at each step had to be evaluated and built into the design. A mixed processing environment of Macintosh and Unix platforms had to be integrated. Most importantly, the system had to save time, reduce error, and ensure uniformity of the analysis and quality of the results. In the authors´ experience, the tools they have built work well because of their early decisions as to which systems to use for development. The authors settled on a robust relational database management system (Sybase) and a portable development environment (C, C++)
Keywords :
DNA; biology computing; genetics; laboratory techniques; relational databases; Macintosh; Sybase; Unix; analysis infrastructure; genome research; high-throughput DNA sequencing projects; informatics; laboratory data analysis; mixed processing environment; molecular biology; portable development environment; relational database management system; sample preparation; Bioinformatics; Biology computing; DNA; Data analysis; Genomics; Humans; Informatics; Laboratories; Large-scale systems; Programming;
Journal_Title :
Engineering in Medicine and Biology Magazine, IEEE