Multimodal object recognition from visual and audio sequences

Author

Weipeng He;Haojun Guan;Jianwei Zhang

Author_Institution

TAMS, Department of Informatics, University of Hamburg, Vogt-Kö

fYear

2015

Firstpage

133

Lastpage

138

Abstract

This paper describes a visual-audio object recognition system using hidden Markov models. The system uses the bag-of-words model with scale invariant feature transform descriptors as the visual feature and the mel-frequency cepstrum coefficients as the audio feature. The classification of objects is based on the computation of the probabilities with learned hidden Markov models. Two different fusion methods are used in the system: feature fusion and decision fusion. The former method learns a joint probability distribution with one HMM, while the latter method learns two separate distributions for each modality and combines them under the conditional independence assumption. Experiments based on a dataset of 33 different household objects are carried out to evaluate the performance of these two fusion methods as well as unimodal approaches. The result shows that both fusion methods outperform unimodal methods, while these two methods are mostly comparable.

Keywords

"Hidden Markov models","Visualization","Object recognition","Joints","Feature extraction","Videos","Covariance matrices"

Publisher

ieee

Conference_Titel

Multisensor Fusion and Integration for Intelligent Systems (MFI), 2015 IEEE International Conference on

Type

conf

DOI

10.1109/MFI.2015.7295798

Filename

7295798