مرکز منطقه ای اطلاع رساني علوم و فناوري - An HMM-based over-sampling technique to improve text classification

Title of article :

An HMM-based over-sampling technique to improve text classification

Author/Authors :

Iglesias، نويسنده , , E.L. and Seara Vieira، نويسنده , , A. and Borrajo، نويسنده , , L.، نويسنده ,

Issue Information :

روزنامه با شماره پیاپی سال 2013

Pages :

From page :

7184

To page :

7192

Abstract :

This paper presents a novel over-sampling method based on document content to handle the class imbalance problem in text classification. The new technique, COS-HMM (Content-based Over-Sampling HMM), includes an HMM that is trained with a corpus in order to create new samples according to current documents. The HMM is treated as a document generator which can produce synthetical instances formed on what it was trained with. onstrate its achievement, COS-HMM is tested with a Support Vector Machine (SVM) in two medical documental corpora (OHSUMED and TREC Genomics), and is then compared with the Random Over-Sampling (ROS) and SMOTE techniques. Results suggest that the application of over-sampling strategies increases the global performance of the SVM to classify documents. Based on the empirical and statistical studies, the new method clearly outperforms the baseline method (ROS), and offers a greater performance than SMOTE in the majority of tested cases.

Keywords :

Hidden Markov model , Text classification , Oversampling techniques

Journal title :

Expert Systems with Applications

Serial Year :

2013

Journal title :

Expert Systems with Applications

Record number :

2354087

Link To Document :

https://search.isc.ac/dl/search/defaultta.aspx?DTC=10&DC=2354087