DocumentCode
2338751
Title
A comparative study of topic identification on newspaper and e-mail
Author
Bigi, Brigitte ; Brun, Armelle ; Haton, Jean-Paul ; Smaïli, Kamel ; Zitouni, Imed
Author_Institution
LORIA/INRIA-Lorraine
fYear
2001
fDate
13-15 Nov. 2001
Firstpage
238
Lastpage
241
Abstract
This work presents several statistical methods for topic identification on two kinds of textual data: newspaper articles and e-mails. Five methods are tested on these two corpora: topic unigrams, cache model, TFIDF classijier, topic peqdexity, and weighted model. Our work aims to study these methods by confronting them to very diferent data. This study is very fruitful for our research. Statistical topic identiJication methods depend not only on a corpus, but also on its type. One of the methods achieves a topic identiJcation of 80% on a general newspaper corpus but does not exceed 30% on e-mail corpus. Another method gives the best result on e-mails, but has not the same behavior on a newspaper corpus. We also show in this paper that almost all our methods achieve good results in retrieving the first two manually annotated labels.
Keywords
Electronic mail; Information retrieval; Natural languages; Routing; Speech recognition; Statistical analysis; Testing; Text categorization; Vocabulary;
fLanguage
English
Publisher
ieee
Conference_Titel
String Processing and Information Retrieval, 2001. SPIRE 2001. Proceedings.Eighth International Symposium on
Conference_Location
Laguna de San Rafael, Chile
Print_ISBN
0-7695-1192-9
Type
conf
DOI
10.1109/SPIRE.2001.989770
Filename
989770
Link To Document