• DocumentCode
    1504476
  • Title

    Empirical Investigations into Full-Text Protein Interaction Article Categorization Task (ACT) in the BioCreative II.5 Challenge

  • Author

    Lan, Man ; Su, Jian

  • Author_Institution
    Inst. for Infocomm Res., Singapore, Singapore
  • Volume
    7
  • Issue
    3
  • fYear
    2010
  • Firstpage
    421
  • Lastpage
    427
  • Abstract
    The selection of protein interaction documents is one important application for biology research and has a direct impact on the quality of downstream BioNLP applications, i.e., information extraction and retrieval, summarization, QA, etc. The BioCreative II.5 Challenge Article Categorization task (ACT) involves doing a binary text classification to determine whether a given structured full-text article contains protein interaction information. This may be the first attempt at classification of full-text protein interaction documents in wide community. In this paper, we compare and evaluate the effectiveness of different section types in full-text articles for text classification. Moreover, in practice, the less number of true-positive samples results in unstable performance and unreliable classifier trained on it. Previous research on learning with skewed class distributions has altered the class distribution using oversampling and downsampling. We also investigate the skewed protein interaction classification and analyze the effect of various issues related to the choice of external sources, oversampling training sets, classifiers, etc. We report on the various factors above to show that 1) a full-text biomedical article contains a wealth of scientific information important to users that may not be completely represented by abstracts and/or keywords, which improves the accuracy performance of classification and 2) reinforcing true-positive samples significantly increases the accuracy and stability performance of classification.
  • Keywords
    bioinformatics; information retrieval; proteins; scientific information systems; text analysis; BioCreative II.5 challenge; article categorization task; binary text classification; biology research; downstream BioNLP applications; full text biomedical article; full text protein interaction; information extraction; information retrieval; scientific information; BioCreative.; Protein interaction; full-text article; text classification; Abstracting and Indexing as Topic; Automatic Data Processing; Computational Biology; Natural Language Processing; Pattern Recognition, Automated; Periodicals as Topic; Protein Interaction Mapping;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2010.49
  • Filename
    5473208