Title :
Automatic Training Data Cleaning for Text Classification
Author :
Malik, Hassan H. ; Bhardwaj, Vikas S.
Author_Institution :
Thomson Reuters, New York, NY, USA
Abstract :
Supervised text classification algorithms rely on the availability of large quantities of quality training data to achieve their optimal performance. However, not all training data is created equal and the quality of class-labels assigned by human experts may vary greatly with their levels of experience, domain knowledge, and the time available to label each document. In our experiments, focused label validation and correction by expert journalists improved the Micro and Macro-F1 scores achieved by Linear SVMs by as much as 14.5% and 30% respectively, on a corpus of professionally labeled news stories. Manual label correction is an expensive and time consuming process and the classification quality may not linearly improve with the amount of time spent, making it increasingly more expensive to achieve higher classification quality targets. We propose ATDC, a novel evidence-based training data cleaning method that uses training examples with high-quality class labels to automatically validate and correct labels of noisy training data. A subset of these instances are then selected to augment the original training set. On a large noisy dataset with about two million news stories, our method improved the baseline Micro-F1 and Macro-F1 scores by 9% and 13% respectively, without requiring any further human intervention.
Keywords :
learning (artificial intelligence); pattern classification; support vector machines; text analysis; Macro-F1 scores; Micro-F1 scores; automatic training data cleaning; document; domain knowledge; evidence based training data cleaning method; expert journalists; linear SVM; manual label correction; supervised text classification algorithms; Cleaning; Clustering algorithms; Humans; Manuals; Noise measurement; Training; Training data; Text classification; training data cleaning;
Conference_Titel :
Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on
Conference_Location :
Vancouver, BC
Print_ISBN :
978-1-4673-0005-6
DOI :
10.1109/ICDMW.2011.36