• DocumentCode
    2338751
  • Title

    A comparative study of topic identification on newspaper and e-mail

  • Author

    Bigi, Brigitte ; Brun, Armelle ; Haton, Jean-Paul ; Smaïli, Kamel ; Zitouni, Imed

  • Author_Institution
    LORIA/INRIA-Lorraine
  • fYear
    2001
  • fDate
    13-15 Nov. 2001
  • Firstpage
    238
  • Lastpage
    241
  • Abstract
    This work presents several statistical methods for topic identification on two kinds of textual data: newspaper articles and e-mails. Five methods are tested on these two corpora: topic unigrams, cache model, TFIDF classijier, topic peqdexity, and weighted model. Our work aims to study these methods by confronting them to very diferent data. This study is very fruitful for our research. Statistical topic identiJication methods depend not only on a corpus, but also on its type. One of the methods achieves a topic identiJcation of 80% on a general newspaper corpus but does not exceed 30% on e-mail corpus. Another method gives the best result on e-mails, but has not the same behavior on a newspaper corpus. We also show in this paper that almost all our methods achieve good results in retrieving the first two manually annotated labels.
  • Keywords
    Electronic mail; Information retrieval; Natural languages; Routing; Speech recognition; Statistical analysis; Testing; Text categorization; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    String Processing and Information Retrieval, 2001. SPIRE 2001. Proceedings.Eighth International Symposium on
  • Conference_Location
    Laguna de San Rafael, Chile
  • Print_ISBN
    0-7695-1192-9
  • Type

    conf

  • DOI
    10.1109/SPIRE.2001.989770
  • Filename
    989770