Computationally-efficient endpointing features for natural spoken interaction with personal-assistant systems

Author

Arsikere, Harish ; Shriberg, Elizabeth ; Ozertem, Umut

Author_Institution

Electr. Eng. Dept., Univ. of California, Los Angeles, Los Angeles, CA, USA

fYear

2014

fDate

4-9 May 2014

Firstpage

3241

Lastpage

3245

Abstract

Current speech-input systems typically use a nonspeech threshold for end-of-utterance detection. While usually sufficient for short utterances, the approach can cut speakers off during pauses in more complex utterances. We elicit personal-assistant speech (reminders, calendar entries, messaging, search) using a recognizer with a dramatically increased endpoint threshold, and find frequent nonfinal pauses. A standard endpointer with a 500 ms threshold (latency) results in a 36% cutoff rate for this corpus. Based on the new data, we develop low-cost acoustic features to discriminate nonfinal from final pauses. Features capture periodicity, speaking rate, spectral constancy, duration/intensity, and pitch of prepausal speech - using no speech recognition, speaker or session information. Classification experiments yield 20% EER at a 100 ms latency, thereby reducing both cutoffs and latency compared with the threshold-only baseline. Additional results on computational cost, feature importance, and speaker differences are discussed.

Keywords

feature extraction; natural language interfaces; natural language processing; speaker recognition; speech-based user interfaces; EER; computationally-efficient endpointing features; end-of-utterance detection; endpoint threshold; feature importance; natural spoken interaction; nonspeech threshold; personal-assistant speech; personal-assistant systems; prepausal speech; session information; speaker differences; speaker information; speech recognition; speech-input systems; time 100 ms; time 500 ms; Databases; Feature extraction; Market research; Modulation; Speech; Speech recognition; Standards; acoustic-prosodic features; computationally efficient; endpointing; pausing; personal assistants;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on

Conference_Location

Florence

Type

conf

DOI

10.1109/ICASSP.2014.6854199

Filename

6854199