Title :
Identifying Sentence-Level Semantic Content Units with Topic Models
Author :
Hennig, Leonhard ; Strecker, Thomas ; Narr, Sascha ; De Luca, Ernesto William ; Albayrak, Sahin
Author_Institution :
Distrib. Artificial Intell. Lab. (DAI-Lab.), Tech. Univ., Berlin, Germany
fDate :
Aug. 30 2010-Sept. 3 2010
Abstract :
Statistical approaches to document content modeling typically focus either on broad topics or on discourse-level subtopics of a text. We present an analysis of the performance of probabilistic topic models on the task of learning sentence-level topics that are similar to facts. The identification of sentential content with the same meaning is an important task in multi-document summarization and the evaluation of multi-document summaries. In our approach, each sentence is represented as a distribution over topics, and each topic is a distribution over words. We compare the topic-sentence assignments discovered by a topic model to gold-standard assignments that were manually annotated on a set of closely related pairs of news articles. We observe a clear correspondence between automatically identified and annotated topics. The high accuracy of automatically discovered topic-sentence assignments suggests that topic models can be utilized to identify (sub-) sentential semantic content units.
Keywords :
content management; data mining; text analysis; document content modeling; gold standard assignment; multidocument summarization; probabilistic topic model; sentence level semantic content units identification; sentence level topics learning; Analytical models; Humans; Petroleum; Probabilistic logic; Resource management; Semantics; Storage tanks; latent dirichlet allocation; text summarization; topic models;
Conference_Titel :
Database and Expert Systems Applications (DEXA), 2010 Workshop on
Conference_Location :
Bilbao
Print_ISBN :
978-1-4244-8049-4
DOI :
10.1109/DEXA.2010.33