DocumentCode :
2771360
Title :
Assigning Schema Labels Using Ontology And Hueristics
Author :
Zhang, Xuan ; Ruoming Jin ; Agrawal, Gagan
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH
fYear :
2006
fDate :
16-18 Oct. 2006
Firstpage :
269
Lastpage :
280
Abstract :
Bioinformatics data is growing at a phenomenal rate. Besides the exponential growth of individual databases, the number of data depositories is increasing too. Because of the complexity of the biological concepts, bioinformatics data usually has complex data structures and cannot be easily captured with relational model. As a result, various flat-file formats have been used. Although easy for human interpretation, flat-file formats lack of standards and are hard to be recognized automatically. As a result, manually written parsers are widely used to extract data from them. This has limited the readiness of the data for data consuming programs, such as integration systems. This paper presents a data mining based approach for automatically assigning schema labels to the attributes in a flat-file biological dataset. In conjunction with our prior work on semi-automatically identifying the delimiters and automatically generating parsers, automatic schema labeling offers a novel and practical solution for integrating biological datasets on-the-fly. Our approach for schema labeling is based on unsupervised learning, and uses a feature representation of an attribute by most frequently occurring data values in it. We combine the use of a biological ontology with heuristics. We are able to deal with noise in the datasets by using cutoff functions. Detailed experimental results from three datasets demonstrate the effectiveness of the use of data mining for biological applications
Keywords :
biology computing; data mining; grammars; heuristic programming; ontologies (artificial intelligence); unsupervised learning; automatically assigning schema labels; automatically generating parsers; bioinformatics data; biological ontology; data consuming programs; data depositories; data extraction; data mining based approach; data structures; delimiters; feature representation; flat-file biological dataset; heuristics; integration systems; manually written parsers; unsupervised learning; Bioinformatics; Clustering algorithms; Computer science; Data analysis; Data engineering; Data mining; Humans; Labeling; Ontologies; Unsupervised learning;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
BioInformatics and BioEngineering, 2006. BIBE 2006. Sixth IEEE Symposium on
Conference_Location :
Arlington, VA
Print_ISBN :
0-7695-2727-2
Type :
conf
DOI :
10.1109/BIBE.2006.253344
Filename :
4019669
Link To Document :
بازگشت