• DocumentCode
    2771360
  • Title

    Assigning Schema Labels Using Ontology And Hueristics

  • Author

    Zhang, Xuan ; Ruoming Jin ; Agrawal, Gagan

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH
  • fYear
    2006
  • fDate
    16-18 Oct. 2006
  • Firstpage
    269
  • Lastpage
    280
  • Abstract
    Bioinformatics data is growing at a phenomenal rate. Besides the exponential growth of individual databases, the number of data depositories is increasing too. Because of the complexity of the biological concepts, bioinformatics data usually has complex data structures and cannot be easily captured with relational model. As a result, various flat-file formats have been used. Although easy for human interpretation, flat-file formats lack of standards and are hard to be recognized automatically. As a result, manually written parsers are widely used to extract data from them. This has limited the readiness of the data for data consuming programs, such as integration systems. This paper presents a data mining based approach for automatically assigning schema labels to the attributes in a flat-file biological dataset. In conjunction with our prior work on semi-automatically identifying the delimiters and automatically generating parsers, automatic schema labeling offers a novel and practical solution for integrating biological datasets on-the-fly. Our approach for schema labeling is based on unsupervised learning, and uses a feature representation of an attribute by most frequently occurring data values in it. We combine the use of a biological ontology with heuristics. We are able to deal with noise in the datasets by using cutoff functions. Detailed experimental results from three datasets demonstrate the effectiveness of the use of data mining for biological applications
  • Keywords
    biology computing; data mining; grammars; heuristic programming; ontologies (artificial intelligence); unsupervised learning; automatically assigning schema labels; automatically generating parsers; bioinformatics data; biological ontology; data consuming programs; data depositories; data extraction; data mining based approach; data structures; delimiters; feature representation; flat-file biological dataset; heuristics; integration systems; manually written parsers; unsupervised learning; Bioinformatics; Clustering algorithms; Computer science; Data analysis; Data engineering; Data mining; Humans; Labeling; Ontologies; Unsupervised learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    BioInformatics and BioEngineering, 2006. BIBE 2006. Sixth IEEE Symposium on
  • Conference_Location
    Arlington, VA
  • Print_ISBN
    0-7695-2727-2
  • Type

    conf

  • DOI
    10.1109/BIBE.2006.253344
  • Filename
    4019669