مرکز منطقه ای اطلاع رساني علوم و فناوري - Latent Semantic Analysis for Multimodal User Input With Speech and Gestures

DocumentCode :

19448

Title :

Latent Semantic Analysis for Multimodal User Input With Speech and Gestures

Author :

Pui-Yu Hui ; Meng, Hsiang-Yun

Author_Institution :

Human-Comput. Commun. Lab., Chinese Univ. of Hong Kong, Hong Kong, China

Volume :

Issue :

fYear :

2014

fDate :

Feb. 2014

Firstpage :

417

Lastpage :

429

Abstract :

This paper describes our work in semantic interpretation of a “multimodal language” with speech and gestures using latent semantic analysis (LSA). Our aim is to infer the domain-specific informational goal of multimodal inputs. The informational goal is characterized by lexical terms used in the spoken modality, partial semantics of gestures in the pen modality, as well as term co-occurrence patterns across modalities, leading to “multimodal terms.” We designed and collected a multimodal corpus of navigational inquiries. We also obtained perfect (i.e. manual) and imperfect (i.e. automatic via recognition) transcriptions for these. We automatically align parsed spoken locative references (SLRs) with their corresponding pen gesture(s) using the Viterbi alignment, according to their numeric and location type features. Then, we characterize each cross-modal integration pattern as a 3-tuple multimodal term with SLR, pen gesture type and their temporal relationship. We propose to use latent semantic analysis (LSA) to derive the latent semantics from manual (i.e. perfect) and automatic (i.e. imperfect) transcriptions of the collected multimodal inputs. In order to achieve this, both multimodal and lexical terms are used to compose an inquiry-term matrix, which is then factorized using singular value decomposition (SVD) to derive the latent semantics automatically. Informational goal inference based on the latent semantics shows that the informational goal inference accuracy of a disjoint test set is 99% and 84% when a perfect and imperfect projection model is used respectively, which performs significantly better than (at least 9.9% absolute) the baseline performance using vector-space model (VSM).

Keywords :

gesture recognition; singular value decomposition; speech recognition; user interfaces; 3-tuple multimodal term; LSA; SLR; SVD; VSM; Viterbi alignment; automatic transcription; collected multimodal inputs; cross-modal integration pattern; domain-specific informational goal; imperfect projection model; inquiry-term matrix; latent semantic analysis; lexical terms; manual transcription; multimodal corpus; multimodal language; multimodal terms; multimodal user input; navigational inquiries; parsed spoken locative references; partial gesture semantics; pen modality; perfect projection model; semantic interpretation; singular value decomposition; speech; spoken modality; temporal relationship; term co-occurrence patterns; Educational institutions; Manuals; Semantics; Speech; Speech processing; Speech recognition; Training; Multimodal user interfaces; gesture recognition; latent semantic analysis; speech recognition;

fLanguage :

English

Journal_Title :

Audio, Speech, and Language Processing, IEEE/ACM Transactions on

Publisher :

ieee

ISSN :

2329-9290

Type :

jour

DOI :

10.1109/TASLP.2013.2294586

Filename :

6680708

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=19448