DocumentCode :
3419203
Title :
Text categorization of Enron email corpus based on information bottleneck and maximal entropy
Author :
Wang, Man ; He, Yifan ; Jiang, Minghu
Author_Institution :
Sch. of Humanities & Social Sci., Tsinghua Univ., Beijing, China
fYear :
2010
fDate :
24-28 Oct. 2010
Firstpage :
2472
Lastpage :
2475
Abstract :
This paper is for text categorization of Enron email corpus, we use the information bottleneck (IB) method to cluster the key words based on their distribution on different class labels, then we use threads and address groups as additional features to email texts, and the maximal entropy model to improve the accuracy of the classifier. Our experimental results shows that these measures can improve the classifier´s performances, for keywords change too rapidly in emails while address groups are much steadier.
Keywords :
classification; electronic mail; entropy; pattern clustering; text analysis; Enron email corpus; classifier performance; email text; information bottleneck; key word clustering; maximal entropy; text categorization; Electronic mail; Entropy; Feature extraction; Text categorization; Training; data mining; email corpus; text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Signal Processing (ICSP), 2010 IEEE 10th International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-5897-4
Type :
conf
DOI :
10.1109/ICOSP.2010.5656737
Filename :
5656737
Link To Document :
بازگشت