Visual speech feature extraction for improved speech recognition

Author

Zhang, X. ; Mersereau, R.M. ; Clements, M. ; Broun, C.C.

Author_Institution

Center for Signal & Image Processing, Georgia Institute of Technology, Atlanta, 30332-0250, USA

Volume

2

fYear

2002

fDate

13-17 May 2002

Abstract

Mainstream automatic speech recognition has focused almost exclusively on the acoustic signal. The performance of these systems degrades considerably in the real world in the presence of noise. On the other hand, most human listeners, both hearing-impaired and normal hearing, make use of visual information to improve speech perception in acoustically hostile environments. Motivated by humans´ ability to lipread, the visual component is considered to yield information that is not always present in the acoustic signal and enables improved accuracy over totally acoustic systems, especially in noisy environments. In this paper, we investigate the usefulness of visual information in speech recognition. We first present a method for automatically locating and extracting visual speech features from a talking person in color video sequences. We then develop a recognition engine to train and recognize sequences of visual parameters for the purpose of speech recognition. We particularly explore the impact of various combinations of visual features on the recognition accuracy. We conclude that the inner lip contour features together with the information about the visibility of the tongue and teeth significantly improve the performance over using outer contour only features in both speaker dependent and speaker independent recognition tasks.

Keywords

Feature extraction; Hidden Markov models;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on

Conference_Location

Orlando, FL, USA

ISSN

1520-6149

Print_ISBN

0-7803-7402-9

Type

conf

DOI

10.1109/ICASSP.2002.5745022

Filename

5745022