• DocumentCode
    3128264
  • Title

    Automatic Training Data Cleaning for Text Classification

  • Author

    Malik, Hassan H. ; Bhardwaj, Vikas S.

  • Author_Institution
    Thomson Reuters, New York, NY, USA
  • fYear
    2011
  • fDate
    11-11 Dec. 2011
  • Firstpage
    442
  • Lastpage
    449
  • Abstract
    Supervised text classification algorithms rely on the availability of large quantities of quality training data to achieve their optimal performance. However, not all training data is created equal and the quality of class-labels assigned by human experts may vary greatly with their levels of experience, domain knowledge, and the time available to label each document. In our experiments, focused label validation and correction by expert journalists improved the Micro and Macro-F1 scores achieved by Linear SVMs by as much as 14.5% and 30% respectively, on a corpus of professionally labeled news stories. Manual label correction is an expensive and time consuming process and the classification quality may not linearly improve with the amount of time spent, making it increasingly more expensive to achieve higher classification quality targets. We propose ATDC, a novel evidence-based training data cleaning method that uses training examples with high-quality class labels to automatically validate and correct labels of noisy training data. A subset of these instances are then selected to augment the original training set. On a large noisy dataset with about two million news stories, our method improved the baseline Micro-F1 and Macro-F1 scores by 9% and 13% respectively, without requiring any further human intervention.
  • Keywords
    learning (artificial intelligence); pattern classification; support vector machines; text analysis; Macro-F1 scores; Micro-F1 scores; automatic training data cleaning; document; domain knowledge; evidence based training data cleaning method; expert journalists; linear SVM; manual label correction; supervised text classification algorithms; Cleaning; Clustering algorithms; Humans; Manuals; Noise measurement; Training; Training data; Text classification; training data cleaning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on
  • Conference_Location
    Vancouver, BC
  • Print_ISBN
    978-1-4673-0005-6
  • Type

    conf

  • DOI
    10.1109/ICDMW.2011.36
  • Filename
    6137413