DocumentCode
13136
Title
Joint Video and Text Parsing for Understanding Events and Answering Queries
Author
Kewei Tu ; Meng Meng ; Mun Wai Lee ; Tae Eun Choe ; Song-Chun Zhu
Author_Institution
ShanghaiTech Univ., Shanghai, China
Volume
21
Issue
2
fYear
2014
fDate
Apr.-June 2014
Firstpage
42
Lastpage
70
Abstract
This article proposes a multimedia analysis framework to process video and text jointly for understanding events and answering user queries. The framework produces a parse graph that represents the compositional structures of spatial information (objects and scenes), temporal information (actions and events), and causal information (causalities between events and fluents) in the video and text. The knowledge representation of the framework is based on a spatial-temporal-causal AND-OR graph (S/T/C-AOG), which jointly models possible hierarchical compositions of objects, scenes, and events as well as their interactions and mutual contexts, and specifies the prior probabilistic distribution of the parse graphs. The authors present a probabilistic generative model for joint parsing that captures the relations between the input video/text, their corresponding parse graphs, and the joint parse graph. Based on the probabilistic model, the authors propose a joint parsing system consisting of three modules: video parsing, text parsing, and joint inference. Video parsing and text parsing produce two parse graphs from the input video and text, respectively. The joint inference module produces a joint parse graph by performing matching, deduction, and revision on the video and text parse graphs. The proposed framework has the following objectives: to provide deep semantic parsing of video and text that goes beyond the traditional bag-of-words approaches; to perform parsing and reasoning across the spatial, temporal, and causal dimensions based on the joint S/T/C-AOG representation; and to show that deep joint parsing facilitates subsequent applications such as generating narrative text descriptions and answering queries in the forms of who, what, when, where, and why. The authors empirically evaluated the system based on comparison against ground-truth as well as accuracy of query answering and obtained satisfactory results.
Keywords
graph theory; inference mechanisms; knowledge representation; query processing; statistical distributions; text analysis; video signal processing; bag-of-words approach; causal information; events understanding; joint inference; knowledge representation; multimedia analysis framework; narrative text descriptions; parse graph; prior probabilistic distribution; probabilistic generative model; spatial information; spatial-temporal-causal AND-OR graph; temporal information; text parsing; text processing; user query answering; video parsing; video processing; Computational modeling; Computer vision; Multimedia communication; Probabilistic logic; Semantics; Streaming media; Text recognition; AND-OR graph; joint video and text parsing; knowledge representation; multimedia; multimedia video analysis; query answering;
fLanguage
English
Journal_Title
MultiMedia, IEEE
Publisher
ieee
ISSN
1070-986X
Type
jour
DOI
10.1109/MMUL.2014.29
Filename
6818956
Link To Document