DocumentCode :
3166940
Title :
Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval
Author :
Wang, Xuerui ; McCallum, Andrew ; Wei, Xing
Author_Institution :
Univ. of Massachusetts, Amherst
fYear :
2007
fDate :
28-31 Oct. 2007
Firstpage :
697
Lastpage :
702
Abstract :
Most topic models, such as latent Dirichlet allocation, rely on the bag-of-words assumption. However, word order and phrases are often critical to capturing the meaning of text in many text mining tasks. This paper presents topical n-grams, a topic model that discovers topics as well as topical phrases. The probabilistic model generates words in their textual order by, for each word, first sampling a topic, then sampling its status as a unigram or bigram, and then sampling the word from a topic-specific unigram or bigram distribution. Thus our model can model "white house" as a special meaning phrase in the \´politics\´ topic, but not in the \´real estate\´ topic. Successive bigrams form longer phrases. We present experiments showing meaningful phrases and more interpretable topics from the NIPS data and improved information retrieval performance on a TREC collection.
Keywords :
data mining; information retrieval; probability; sampling methods; text analysis; information retrieval; phrase/topic discovery; probabilistic model; text mining; topic model; topic-specific bigram distribution; topic-specific unigram distribution; topical n-grams; word sampling; Artificial neural networks; Biological neural networks; Context modeling; Data mining; Information retrieval; Natural language processing; Neuroscience; Sampling methods; Text mining; Vocabulary;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on
Conference_Location :
Omaha, NE
ISSN :
1550-4786
Print_ISBN :
978-0-7695-3018-5
Type :
conf
DOI :
10.1109/ICDM.2007.86
Filename :
4470313
Link To Document :
بازگشت