• DocumentCode
    66892
  • Title

    Model-Based Unsupervised Spoken Term Detection with Spoken Queries

  • Author

    Chun-an Chan ; Lin-Shan Lee

  • Author_Institution
    Grad. Inst. of Commun. Eng., Nat. Taiwan Univ., Taipei, Taiwan
  • Volume
    21
  • Issue
    7
  • fYear
    2013
  • fDate
    Jul-13
  • Firstpage
    1330
  • Lastpage
    1342
  • Abstract
    We present a set of model-based approaches for unsupervised spoken term detection (STD) with spoken queries that requires neither speech recognition nor annotated data. This work shows the possibilities in migrating from DTW-based to model-based approaches for unsupervised STD. The proposed approach consists of three components: self-organizing models, query matching, and query modeling. To construct the self-organizing models, repeated patterns are captured and modeled using acoustic segment models (ASMs). In the query matching phase, a document state matching (DSM) approach is proposed to represent documents as ASM sequences, which are matched to the query frames. In this way, not only do the ASMs better model the signal distributions and time trajectories of speech, but the much-smaller number of states than frames for the documents leads to a much lower computational load. A novel duration-constrained Viterbi (DC-Vite) algorithm is further proposed for the above matching process to handle the speaking rate distortion problem. In the query modeling phase, a pseudo likelihood ratio (PLR) approach is proposed in the pseudo relevance feedback (PRF) framework. A likelihood ratio evaluated with query/anti-query HMMs trained with pseudo relevant/irrelevant examples is used to verify the detected spoken term hypotheses. The proposed framework demonstrates the usefulness of ASMs for STD in zero-resource settings and the potential of an instantly responding STD system using ASM indexing. The best performance is achieved by integrating DTW-based approaches into the rescoring steps in the proposed framework. Experimental results show an absolute 14.2% of mean average precision improvement with 77% CPU time reduction compared with the segmental DTW approach on a Mandarin broadcast news corpus. Consistent improvements were found on TIMIT and MediaEval 2011 Spoken Web Search corpus.
  • Keywords
    hidden Markov models; maximum likelihood estimation; query processing; speech recognition; unsupervised learning; ASM indexing; DC-Vite algorithm; MediaEval 2011 Spoken Web Search corpus; acoustic segment models; annotated data; document state matching; duration-constrained Viterbi algorithm; dynamic time warping; mean average precision improvement; model-based unsupervised spoken term detection; pseudolikelihood ratio approach; pseudorelevance feedback framework; pseudorelevant-irrelevant examples; query frames; query matching phase; query-antiquery HMM; repeated patterns; self-organizing models; signal distributions; speaking rate distortion problem; speech recognition; spoken queries; time trajectory; zero-resource settings; Acoustics; Data models; Hidden Markov models; Speech; Speech recognition; Trajectory; Viterbi algorithm; Acoustic segment model; dynamic time warping; unsupervised spoken term detection; zero-resource;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1558-7916
  • Type

    jour

  • DOI
    10.1109/TASL.2013.2248714
  • Filename
    6469170