A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

Author

Das, Pritam ; Chenliang Xu ; Doell, Richard F. ; Corso, Jason J.

Author_Institution

Comput. Sci. & Eng., SUNY at Buffalo, Buffalo, NY, USA

fYear

2013

fDate

23-28 June 2013

Firstpage

2634

Lastpage

2641

Abstract

The problem of describing images through natural language has gained importance in the computer vision community. Solutions to image description have either focused on a top-down approach of generating language through combinations of object detections and language models or bottom-up propagation of keyword tags from training images to test images through probabilistic or nearest neighbor techniques. In contrast, describing videos with natural language is a less studied problem. In this paper, we combine ideas from the bottom-up and top-down approaches to image description and propose a method for video description that captures the most relevant contents of a video in a natural language description. We propose a hybrid system consisting of a low level multimodal latent topic model for initial keyword annotation, a middle level of concept detectors and a high level module to produce final lingual descriptions. We compare the results of our system to human descriptions in both short and long forms on two datasets, and demonstrate that final system output has greater agreement with the human descriptions than any single level.

Keywords

natural language processing; object detection; video signal processing; computer vision; concept detectors; human descriptions; image description; initial keyword annotation; language models; low level multimodal latent topic model; natural language description; object detection; sparse object stitching; video description; videos lingual description; Detectors; Natural languages; Predictive models; Semantics; Training; Videos; Visualization; multimodal topic model; natural language; video to text; video understanding;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on

Conference_Location

Portland, OR

ISSN

1063-6919

Type

conf

DOI

10.1109/CVPR.2013.340

Filename

6619184