• DocumentCode
    2942056
  • Title

    A knowledge engineering approach to recognizing and extracting sequences of nucleic acids from scientific literature

  • Author

    García-Remesal, Miguel ; Maojo, Víctor ; Crespo, José

  • Author_Institution
    Dept. Intel. Artificial, Univ. Politec. de Madrid, Boadilla del Monte, Spain
  • fYear
    2010
  • fDate
    Aug. 31 2010-Sept. 4 2010
  • Firstpage
    1081
  • Lastpage
    1084
  • Abstract
    In this paper we present a knowledge engineering approach to automatically recognize and extract genetic sequences from scientific articles. To carry out this task, we use a preliminary recognizer based on a finite state machine to extract all candidate DNA/RNA sequences. The latter are then fed into a knowledge-based system that automatically discards false positives and refines noisy and incorrectly merged sequences. We created the knowledge base by manually analyzing different manuscripts containing genetic sequences. Our approach was evaluated using a test set of 211 full-text articles in PDF format containing 3134 genetic sequences. For such set, we achieved 87.76% precision and 97.70% recall respectively. This method can facilitate different research tasks. These include text mining, information extraction, and information retrieval research dealing with large collections of documents containing genetic sequences.
  • Keywords
    DNA; bioinformatics; cellular biophysics; data mining; feature extraction; finite state machines; genetics; information retrieval; knowledge engineering; molecular biophysics; DNA sequence; RNA sequence; finite state machine; genetic sequences; information extraction; information retrieval research; knowledge engineering; manuscripts; nucleic acid sequence extraction; nucleic acid sequence recognition; preliminary recognizer; scientific articles; scientific literature; text mining; DNA; Detectors; Knowledge based systems; Noise measurement; Software; Text recognition; Algorithms; Artificial Intelligence; Base Sequence; DNA; Data Mining; Molecular Sequence Data; Natural Language Processing; Pattern Recognition, Automated; Periodicals as Topic; Sequence Analysis, DNA;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Engineering in Medicine and Biology Society (EMBC), 2010 Annual International Conference of the IEEE
  • Conference_Location
    Buenos Aires
  • ISSN
    1557-170X
  • Print_ISBN
    978-1-4244-4123-5
  • Type

    conf

  • DOI
    10.1109/IEMBS.2010.5627316
  • Filename
    5627316