Automatic Training Data Cleaning for Text Classification

Author

Malik, Hassan H. ; Bhardwaj, Vikas S.

Author_Institution

Thomson Reuters, New York, NY, USA

fYear

2011

fDate

11-11 Dec. 2011

Firstpage

442

Lastpage

449

Abstract

Supervised text classification algorithms rely on the availability of large quantities of quality training data to achieve their optimal performance. However, not all training data is created equal and the quality of class-labels assigned by human experts may vary greatly with their levels of experience, domain knowledge, and the time available to label each document. In our experiments, focused label validation and correction by expert journalists improved the Micro and Macro-F1 scores achieved by Linear SVMs by as much as 14.5% and 30% respectively, on a corpus of professionally labeled news stories. Manual label correction is an expensive and time consuming process and the classification quality may not linearly improve with the amount of time spent, making it increasingly more expensive to achieve higher classification quality targets. We propose ATDC, a novel evidence-based training data cleaning method that uses training examples with high-quality class labels to automatically validate and correct labels of noisy training data. A subset of these instances are then selected to augment the original training set. On a large noisy dataset with about two million news stories, our method improved the baseline Micro-F1 and Macro-F1 scores by 9% and 13% respectively, without requiring any further human intervention.

Keywords

learning (artificial intelligence); pattern classification; support vector machines; text analysis; Macro-F₁ scores; Micro-F₁ scores; automatic training data cleaning; document; domain knowledge; evidence based training data cleaning method; expert journalists; linear SVM; manual label correction; supervised text classification algorithms; Cleaning; Clustering algorithms; Humans; Manuals; Noise measurement; Training; Training data; Text classification; training data cleaning;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on

Conference_Location

Vancouver, BC

Print_ISBN

978-1-4673-0005-6

Type

conf

DOI

10.1109/ICDMW.2011.36

Filename

6137413