Title :
Protein named entity classification with probabilistic features derived from GENIA corpus and MEDLINE
Author :
Sumathipala, Sagara ; Yamada, Koichi ; Unehara, Muneyuki
Author_Institution :
Grad. Sch. of Eng., Nagaoka Univ. of Technol., Nagaoka, Japan
Abstract :
Biomédical named entity recognition (BNER) is one of the most essential and initial tasks (discovering relations between biomédical entities, identifying molecular pathways, etc.) of biomédical information retrieval. Although named entity recognition performed well in ordinary text, it still remains challenging in molecular biology domain because of the complex nature of biomédical nomenclature, different kinds of spelling forms and many more reasons. Even though biomédical entities in biological text are found successfully, classifying them into relevant biomédical classes such as genes, proteins, diseases, drug names, etc. is still another challenge and an open question. This paper presents a new method to classify biomédical named entities into protein and non-protein classes. Our approach employs Random Forest, a machine learning algorithm, with a new combination of features. They are orthographic, keyword and morphological, as well as a probabilistic feature called Proteinhood and a Protein-Score feature based on the Medline abstracts cited on the Pubmed, which are the main contributions in the paper. A series of experiments is conducted to compare the proposed approach with other state of the art approaches. Our protein named entity classifier shows significant performance in the experiments on GENIA corpus achieving the highest values of precision 93.8%, recall 83.8% and F-measure 88.5% for protein named entity identification. In this study we showed the effect of new Proteinhood and Protein-Score features as well as adjusting parameters of Random Forest algorithm.
Keywords :
classification; information retrieval; learning (artificial intelligence); medical computing; text analysis; BNER; GENIA corpus; MEDLINE; Pubmed; biological text; biomedical classes; biomedical information retrieval; biomedical named entity recognition; biomedical nomenclature; machine learning algorithm; molecular biology domain; molecular pathways; nonprotein classes; probabilistic features; protein named entity classification; protein-score feature; proteinhood; random forest; Biomedical measurement; Protein engineering; Proteins; Radio frequency; Silicon; Training data; Biomédical named entity; Biomédical text mining; Computational molecular biology; Named entity recognition; Protein named entity;
Conference_Titel :
Soft Computing and Intelligent Systems (SCIS), 2014 Joint 7th International Conference on and Advanced Intelligent Systems (ISIS), 15th International Symposium on
DOI :
10.1109/SCIS-ISIS.2014.7044640