• DocumentCode
    178702
  • Title

    Computationally-efficient endpointing features for natural spoken interaction with personal-assistant systems

  • Author

    Arsikere, Harish ; Shriberg, Elizabeth ; Ozertem, Umut

  • Author_Institution
    Electr. Eng. Dept., Univ. of California, Los Angeles, Los Angeles, CA, USA
  • fYear
    2014
  • fDate
    4-9 May 2014
  • Firstpage
    3241
  • Lastpage
    3245
  • Abstract
    Current speech-input systems typically use a nonspeech threshold for end-of-utterance detection. While usually sufficient for short utterances, the approach can cut speakers off during pauses in more complex utterances. We elicit personal-assistant speech (reminders, calendar entries, messaging, search) using a recognizer with a dramatically increased endpoint threshold, and find frequent nonfinal pauses. A standard endpointer with a 500 ms threshold (latency) results in a 36% cutoff rate for this corpus. Based on the new data, we develop low-cost acoustic features to discriminate nonfinal from final pauses. Features capture periodicity, speaking rate, spectral constancy, duration/intensity, and pitch of prepausal speech - using no speech recognition, speaker or session information. Classification experiments yield 20% EER at a 100 ms latency, thereby reducing both cutoffs and latency compared with the threshold-only baseline. Additional results on computational cost, feature importance, and speaker differences are discussed.
  • Keywords
    feature extraction; natural language interfaces; natural language processing; speaker recognition; speech-based user interfaces; EER; computationally-efficient endpointing features; end-of-utterance detection; endpoint threshold; feature importance; natural spoken interaction; nonspeech threshold; personal-assistant speech; personal-assistant systems; prepausal speech; session information; speaker differences; speaker information; speech recognition; speech-input systems; time 100 ms; time 500 ms; Databases; Feature extraction; Market research; Modulation; Speech; Speech recognition; Standards; acoustic-prosodic features; computationally efficient; endpointing; pausing; personal assistants;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on
  • Conference_Location
    Florence
  • Type

    conf

  • DOI
    10.1109/ICASSP.2014.6854199
  • Filename
    6854199