• DocumentCode
    477953
  • Title

    Detection of Protein Subcellular Localization Based on a Full Syntactic Parser and Semantic Information

  • Author

    Kim, Mi-Young

  • Author_Institution
    Sch. of Comput. Sci. & Eng., Sungshin Women´´s Univ., Seoul
  • Volume
    4
  • fYear
    2008
  • fDate
    18-20 Oct. 2008
  • Firstpage
    407
  • Lastpage
    411
  • Abstract
    A proteinpsilas subcellular localization is considered an essential part of the description of its associated biomolecular phenomena. As the volume of biomolecular reports has increased, there has been a great deal of research on text mining to detect protein subcellular localization information in documents. It has been argued that linguistic information, especially syntactic information, is useful for identifying the subcellular localizations of proteins of interest. However, previous systems for detecting protein subcellular localization information used only shallow syntactic parsers, and showed poor performance. Thus, there remains a need to use a full syntactic parser and to apply deep linguistic knowledge to the analysis of text for protein subcellular localization information. In addition, we have attempted to use semantic information from the WordNet thesaurus. To improve performance in detecting protein subcellular localization information, this paper proposes a three-step method based on a full syntactic dependency parser and semantic information. In the first step, we construct syntactic dependency paths from each protein to its location candidate. In the second step, we retrieve root information of the syntactic dependency paths. In the final step, we extract syn-semantic patterns of protein subtrees and location subtrees. From the root and subtree nodes, we extract syntactic category and syntactic direction as syntactic information, and synset offset of the WordNet thesaurus as semantic information. According to the root information and syn-semantic patterns of subtrees, we extract (protein, localization) pairs. Even with no biomolecular knowledge, our method shows reasonable performance in experimental results using Medline abstract data. In fact, our proposed method gave an F-measure of 74.53% for training data and 58.90% for test data, significantly outperforming previous methods, by 12-25%.
  • Keywords
    biology computing; data mining; grammars; proteins; WordNet thesaurus; protein subcellular localization; semantic information; syntactic dependency path; syntactic parser; text mining; Bioinformatics; Data mining; Fuzzy systems; Information analysis; Information retrieval; Learning systems; Protein engineering; Testing; Text mining; Thesauri;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fuzzy Systems and Knowledge Discovery, 2008. FSKD '08. Fifth International Conference on
  • Conference_Location
    Jinan Shandong
  • Print_ISBN
    978-0-7695-3305-6
  • Type

    conf

  • DOI
    10.1109/FSKD.2008.529
  • Filename
    4666419