Title :
Overfitting in protein name recognition on biomedical literature and method of preventing it through use of transductive SVM
Author :
Murata, Masaki ; Mitsumori, Tomohiro ; Doi, Kouichi
Author_Institution :
National Inst. of Inf. & Commun. Technol., Kyoto
Abstract :
Machine learning methods have been used in research on protein name recognition. A classifier trained in a specific domain, however, could be overfit and so inflexible that it could be used only in that domain. We therefore developed a new corpus about breast cancer and investigated the flexibility of classifier trained on the GENIA (T. Ohta, 2002) corpus or the breast cancer corpus. To avoid overfitting we used the transductive support vector machine (SVM), and we evaluated the effect of transductive learning. We confirmed experimentally that the tranductive SVM prevented overfitting and yielded higher accuracies than the ordinary SVM did
Keywords :
cancer; learning (artificial intelligence); medical computing; pattern classification; proteins; support vector machines; GENIA; biomedical literature; breast cancer corpus; classifier training; machine learning; protein name recognition; transductive learning; transductive support vector machine; Abstracts; Breast cancer; Data mining; Databases; Dictionaries; Hidden Markov models; Proteins; Support vector machine classification; Support vector machines; Training data;
Conference_Titel :
Information Technology, 2007. ITNG '07. Fourth International Conference on
Conference_Location :
Las Vegas, NV
Print_ISBN :
0-7695-2776-0
DOI :
10.1109/ITNG.2007.145