Title :
Analyzing the Effect of Document Representation on Machine Learning Approaches in Multi-Class e-Mail Filtering
Author :
Berger, Helmut ; Dittenbach, Michael ; Merkl, Dieter
Author_Institution :
iSpaces Res. Group, E-Commerce Competence Center, Wien
Abstract :
This paper reports on experiments in multi-class document categorization with supervised machine learning techniques. The document collection consists of of a set of personal e-mail messages. Two distinct document representation formalisms are employed to characterize these messages, namely a standard word-based approach and a character n-gram document representation. Based on these document representations, the categorization performance of five machine learning approaches is assessed and a comparison is given. In principle, both document representation yielded comparable results with the various classifiers. However, the results for the n-gram-based document representation were definitely better in case of an aggressive feature selection strategy
Keywords :
electronic mail; learning (artificial intelligence); text analysis; aggressive feature selection strategy; character n-gram document representation; document representation formalism; multiclass document categorization; multiclass e-mail filtering; personal e-mail message; supervised machine learning approach; word-based approach; Automation; Cities and towns; Classification algorithms; Electronic mail; Filtering; Filters; Machine learning; Research and development; Sorting; Text categorization;
Conference_Titel :
Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on
Conference_Location :
Hong Kong
Print_ISBN :
0-7695-2747-7