Abstract :
The selection of protein interaction documents is one important application for biology research and has a direct impact on the quality of downstream BioNLP applications, i.e., information extraction and retrieval, summarization, QA, etc. The BioCreative II.5 Challenge Article Categorization task (ACT) involves doing a binary text classification to determine whether a given structured full-text article contains protein interaction information. This may be the first attempt at classification of full-text protein interaction documents in wide community. In this paper, we compare and evaluate the effectiveness of different section types in full-text articles for text classification. Moreover, in practice, the less number of true-positive samples results in unstable performance and unreliable classifier trained on it. Previous research on learning with skewed class distributions has altered the class distribution using oversampling and downsampling. We also investigate the skewed protein interaction classification and analyze the effect of various issues related to the choice of external sources, oversampling training sets, classifiers, etc. We report on the various factors above to show that 1) a full-text biomedical article contains a wealth of scientific information important to users that may not be completely represented by abstracts and/or keywords, which improves the accuracy performance of classification and 2) reinforcing true-positive samples significantly increases the accuracy and stability performance of classification.
Keywords :
bioinformatics; information retrieval; proteins; scientific information systems; text analysis; BioCreative II.5 challenge; article categorization task; binary text classification; biology research; downstream BioNLP applications; full text biomedical article; full text protein interaction; information extraction; information retrieval; scientific information; BioCreative.; Protein interaction; full-text article; text classification; Abstracting and Indexing as Topic; Automatic Data Processing; Computational Biology; Natural Language Processing; Pattern Recognition, Automated; Periodicals as Topic; Protein Interaction Mapping;