Title :
Speaker association with signal-level audiovisual fusion
Author :
Fisher, John W., III ; Darrell, Trevor
Author_Institution :
Comput. Sci. & Artificial Intelligence Lab., Massachusetts Inst. of Technol., Cambridge, MA, USA
fDate :
6/1/2004 12:00:00 AM
Abstract :
Audio and visual signals arriving from a common source are detected using a signal-level fusion technique. A probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence. Nonparametric statistical density modeling techniques can characterize the mutual information between signals from different domains. By comparing the mutual information between different pairs of signals, it is possible to identify which person is speaking a given utterance and discount errant motion or audio from other utterances or nonspeech events.
Keywords :
audio signal processing; image sequences; interactive systems; probability; speech recognition; statistical analysis; video signal processing; audio signals; cross-modal correspondence; discount errant motion; mutual information theoretic measure; nonparametric statistical density modeling techniques; nonspeech events; probabilistic multimodal generation model; signal-level audiovisual fusion; speaker data association; visual signals; Computer science; Databases; Fusion power generation; Microphones; Mutual information; Signal detection; Signal processing; Speech recognition; Telephone sets; Telephony; Audiovisual correspondence; multimodal data association; mutual information;
Journal_Title :
Multimedia, IEEE Transactions on
DOI :
10.1109/TMM.2004.827503