Title :
Modeling multiword phrases with constrained phrase trees for improved topic modeling of conversational speech
Author :
Hazen, Timothy J. ; Richardson, F.
Author_Institution :
MIT Lincoln Lab., Lexington, MA, USA
Abstract :
Latent topic modeling has proven to be an effective means for learning the underlying semantic content within document collections. Latent topic modeling has traditionally been applied to bag-of-words representations that ignore word sequence information that can aid in semantic understanding. In this work we introduce a method for efficiently incorporating arbitrarily long word sequences into a topic modeling approach. This method iteratively constructs a constrained set of phrase trees in an unsupervised fashion from a document collection using weighted pointwise mutual information statistics to guide the process. In experiments on the Fisher Corpus of conversational speech, the incorporation of learned phrases into a latent topic model yielded significant improvements in the unsupervised discovery of the known topics present within the data.
Keywords :
document handling; iterative methods; speech processing; Fisher Corpus; bag-of-words representations; constrained phrase trees; conversational speech; document collections; improved topic modeling; iterative methods; modeling multiword phrases; phrase trees; semantic content; weighted pointwise mutual information statistics; Data models; Drugs; Mutual information; Semantics; Standards; Training; Vocabulary; conversational speech; phrases; topic modeling;
Conference_Titel :
Spoken Language Technology Workshop (SLT), 2012 IEEE
Conference_Location :
Miami, FL
Print_ISBN :
978-1-4673-5125-6
Electronic_ISBN :
978-1-4673-5124-9
DOI :
10.1109/SLT.2012.6424226