• DocumentCode
    79783
  • Title

    RLIMS-P 2.0: A Generalizable Rule-Based Information Extraction System for Literature Mining of Protein Phosphorylation Information

  • Author

    Torii, Manabu ; Arighi, Cecilia N. ; Gang Li ; Qinghua Wang ; Wu, Cathy H. ; Vijay-Shanker, K.

  • Author_Institution
    Med. Inf. Group, Kaiser Permanente Southern California, San Diego, CA, USA
  • Volume
    12
  • Issue
    1
  • fYear
    2015
  • fDate
    Jan.-Feb. 1 2015
  • Firstpage
    17
  • Lastpage
    29
  • Abstract
    We introduce RLIMS-P version 2.0, an enhanced rule-based information extraction (IE) system for mining kinase, substrate, and phosphorylation site information from scientific literature. Consisting of natural language processing and IE modules, the system has integrated several new features, including the capability of processing full-text articles and generalizability towards different post-translational modifications (PTMs). To evaluate the system, sets of abstracts and full-text articles, containing a variety of textual expressions, were annotated. On the abstract corpus, the system achieved F-scores of 0.91, 0.92, and 0.95 for kinases, substrates, and sites, respectively. The corresponding scores on the full-text corpus were 0.88, 0.91, and 0.92. It was additionally evaluated on the corpus of the 2013 BioNLP-ST GE task, and achieved an F-score of 0.87 for the phosphorylation core task, improving upon the results previously reported on the corpus. Full-scale processing of all abstracts in MEDLINE and all articles in PubMed Central Open Access Subset has demonstrated scalability for mining rich information in literature, enabling its adoption for biocuration and for knowledge discovery. The new system is generalizable and it will be adapted to tackle other major PTM types. RLIMS-P 2.0 online system is available online (http://proteininformationresource.org/rlimsp/) and the developed corpora are available from iProLINK (http://proteininformationresource.org/iprolink/).
  • Keywords
    biochemistry; bioinformatics; data mining; enzymes; feature extraction; molecular biophysics; molecular configurations; natural language processing; 2013 BioNLP-ST GE task; MEDLINE; PubMed Central Open Access Subset; RLIMS-P 2.0 online system; RLIMS-P version 2.0; abstract corpus; biocuration; enhanced rule-based information extraction system; full-text article processing; generalizable rule-based information extraction system; iProLINK; literature mining kinase; natural language processing; phosphorylation core task; post-translational modifications; protein phosphorylation information; rich-information mining; substrate; system achieved F-scores; textual expressions; Abstracts; Bioinformatics; Computational biology; Pipelines; Proteins; Substrates; Syntactics; Biology and genetics; Context Analysis and Indexing; Natural Language Processing; Text mining; context analysis and indexing; natural language processing; text mining;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2014.2372765
  • Filename
    6977948