DocumentCode :
2338751
Title :
A comparative study of topic identification on newspaper and e-mail
Author :
Bigi, Brigitte ; Brun, Armelle ; Haton, Jean-Paul ; Smaïli, Kamel ; Zitouni, Imed
Author_Institution :
LORIA/INRIA-Lorraine
fYear :
2001
fDate :
13-15 Nov. 2001
Firstpage :
238
Lastpage :
241
Abstract :
This work presents several statistical methods for topic identification on two kinds of textual data: newspaper articles and e-mails. Five methods are tested on these two corpora: topic unigrams, cache model, TFIDF classijier, topic peqdexity, and weighted model. Our work aims to study these methods by confronting them to very diferent data. This study is very fruitful for our research. Statistical topic identiJication methods depend not only on a corpus, but also on its type. One of the methods achieves a topic identiJcation of 80% on a general newspaper corpus but does not exceed 30% on e-mail corpus. Another method gives the best result on e-mails, but has not the same behavior on a newspaper corpus. We also show in this paper that almost all our methods achieve good results in retrieving the first two manually annotated labels.
Keywords :
Electronic mail; Information retrieval; Natural languages; Routing; Speech recognition; Statistical analysis; Testing; Text categorization; Vocabulary;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
String Processing and Information Retrieval, 2001. SPIRE 2001. Proceedings.Eighth International Symposium on
Conference_Location :
Laguna de San Rafael, Chile
Print_ISBN :
0-7695-1192-9
Type :
conf
DOI :
10.1109/SPIRE.2001.989770
Filename :
989770
Link To Document :
بازگشت