Title :
Robust Multimodal Person Identification With Limited Training Data
Author :
McLaughlin, Niall ; Ji Ming ; Crookes, D.
Author_Institution :
Sch. of Electron., Electr. Eng. & Comput. Sci., Queen´s Univ. Belfast, Belfast, UK
fDate :
3/1/2013 12:00:00 AM
Abstract :
This paper presents a novel method of audio-visual feature-level fusion for person identification where both the speech and facial modalities may be corrupted, and there is a lack of prior knowledge about the corruption. Furthermore, we assume there are limited amount of training data for each modality (e.g., a short training speech segment and a single training facial image for each person). A new multimodal feature representation and a modified cosine similarity are introduced to combine and compare bimodal features with limited training data, as well as vastly differing data rates and feature sizes. Optimal feature selection and multicondition training are used to reduce the mismatch between training and testing, thereby making the system robust to unknown bimodal corruption. Experiments have been carried out on a bimodal dataset created from the SPIDRE speaker recognition database and AR face recognition database with variable noise corruption of speech and occlusion in the face images. The system´s speaker identification performance on the SPIDRE database, and facial identification performance on the AR database, is comparable with the literature. Combining both modalities using the new method of multimodal fusion leads to significantly improved accuracy over the unimodal systems, even when both modalities have been corrupted. The new method also shows improved identification accuracy compared with the bimodal systems based on multicondition model training or missing-feature decoding alone.
Keywords :
biometrics (access control); computer graphics; face recognition; feature extraction; image denoising; image fusion; image representation; image segmentation; speaker recognition; AR face recognition database; SPIDRE speaker recognition database; audio-visual feature-level fusion method; bimodal dataset; data rates; facial identification performance; facial modalities; feature sizes; limited training data; mismatch reduction; modified cosine similarity; multicondition training; multimodal feature representation; optimal feature selection; robust multimodal person identification; short training speech segment; single training facial image; speaker identification performance; speech modalities; unknown bimodal corruption; variable noise corruption; Face; Noise; Robustness; Speech; Speech recognition; Training; Training data; Limited training data; multimodal fusion; noisy speech; occluded face; person identification; robustness;
Journal_Title :
Human-Machine Systems, IEEE Transactions on
DOI :
10.1109/TSMCC.2012.2227959