• DocumentCode
    3419203
  • Title

    Text categorization of Enron email corpus based on information bottleneck and maximal entropy

  • Author

    Wang, Man ; He, Yifan ; Jiang, Minghu

  • Author_Institution
    Sch. of Humanities & Social Sci., Tsinghua Univ., Beijing, China
  • fYear
    2010
  • fDate
    24-28 Oct. 2010
  • Firstpage
    2472
  • Lastpage
    2475
  • Abstract
    This paper is for text categorization of Enron email corpus, we use the information bottleneck (IB) method to cluster the key words based on their distribution on different class labels, then we use threads and address groups as additional features to email texts, and the maximal entropy model to improve the accuracy of the classifier. Our experimental results shows that these measures can improve the classifier´s performances, for keywords change too rapidly in emails while address groups are much steadier.
  • Keywords
    classification; electronic mail; entropy; pattern clustering; text analysis; Enron email corpus; classifier performance; email text; information bottleneck; key word clustering; maximal entropy; text categorization; Electronic mail; Entropy; Feature extraction; Text categorization; Training; data mining; email corpus; text categorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Signal Processing (ICSP), 2010 IEEE 10th International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4244-5897-4
  • Type

    conf

  • DOI
    10.1109/ICOSP.2010.5656737
  • Filename
    5656737