• DocumentCode
    3520160
  • Title

    Document Classification for Mining Host Pathogen Protein-Protein Interactions

  • Author

    Xu, Guixian ; Yin, Lanlan ; Torii, Manabu ; Niu, Zhendong ; Wu, Cathy ; Hu, Zhangzhi ; Liu, Hongfang

  • Author_Institution
    DBBB, Georgetown Univ. Med. Center, Washington, DC
  • fYear
    2008
  • fDate
    3-5 Nov. 2008
  • Firstpage
    461
  • Lastpage
    466
  • Abstract
    Due to the heightened concern about bioterrorism and emerging/reemerging infectious diseases, a flood of molecular data about human pathogens has been generated and maintained in disparate databases. However, scientific findings regarding these pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host-pathogen protein-protein interactions from literature. In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions. An annotated corpus consisting of 1,360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of feature selection using the information gain (IG) measure. Document classification systems were designed at two levels, abstract-level and sentence-level. We observed that feature selection was effective not only in reducing the dimensionality of features to build a compact system, but also in improving document classification performance. We also observed abstract-level systems and sentence-level systems yielded different classification of MEDLINE abstracts, and the combination of these systems could improve the overall document classification.
  • Keywords
    biohazards; data mining; database management systems; diseases; document handling; feature extraction; medical information systems; microorganisms; molecular biophysics; pattern classification; proteins; support vector machines; terrorism; MEDLINE; SVM; abstract-level systems; bioterrorism; document classification; feature selection; human pathogens; infectious diseases; information mining; pathogenesis-related protein; protein-protein interactions; sentence-level systems; support vector machines; Abstracts; Bioterrorism; Databases; Diseases; Floods; Humans; Pathogens; Protein engineering; Support vector machine classification; Support vector machines; Document classification; Host Pathogen Protein Protein Interaction; Text Mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Bioinformatics and Biomedicine, 2008. BIBM '08. IEEE International Conference on
  • Conference_Location
    Philadelphia, PA
  • Print_ISBN
    978-0-7695-3452-7
  • Type

    conf

  • DOI
    10.1109/BIBM.2008.66
  • Filename
    4684940