• DocumentCode
    3106614
  • Title

    Adding Semantics to Email Clustering

  • Author

    Li, Hua ; Shen, Dou ; Zhang, Benyu ; Chen, Zheng ; Yang, Qiang

  • Author_Institution
    Microsoft Res. Asia, Beijing
  • fYear
    2006
  • fDate
    18-22 Dec. 2006
  • Firstpage
    938
  • Lastpage
    942
  • Abstract
    This paper presents a novel algorithm to cluster emails according to their contents and the sentence styles of their subject lines. In our algorithm, natural language processing techniques and frequent itemset mining techniques are utilized to automatically generate meaningful generalized sentence patterns (GSPs) from subjects of emails. Then we put forward a novel unsupervised approach which treats GSPs as pseudo class labels and conduct email clustering in a supervised manner, although no human labeling is involved. Our proposed algorithm is not only expected to improve the clustering performance, it can also provide meaningful descriptions of the resulted clusters by the GSPs. Experimental results on open dataset (Enron email dataset) and a personal email dataset collected by ourselves demonstrate that the proposed algorithm outperforms the K-means algorithm in terms of the popular measurement Fl. Furthermore, the cluster naming readability is improved by 68.5% on the personal email dataset.
  • Keywords
    electronic mail; learning (artificial intelligence); natural language processing; pattern clustering; Enron email dataset; email clustering; generalized sentence patterns; itemset mining techniques; natural language processing; open dataset; Asia; Clustering algorithms; Data mining; Humans; Itemsets; Labeling; Natural language processing; Seminars; Taxonomy; Training data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2006. ICDM '06. Sixth International Conference on
  • Conference_Location
    Hong Kong
  • ISSN
    1550-4786
  • Print_ISBN
    0-7695-2701-7
  • Type

    conf

  • DOI
    10.1109/ICDM.2006.16
  • Filename
    4053131