• DocumentCode
    80609
  • Title

    Cross-Lingual Automatic Speech Recognition Using Tandem Features

  • Author

    Lal, Pyare ; King, Simon

  • Author_Institution
    Sch. of Inf., Univ. of Edinburgh, Edinburgh, UK
  • Volume
    21
  • Issue
    12
  • fYear
    2013
  • fDate
    Dec. 2013
  • Firstpage
    2506
  • Lastpage
    2515
  • Abstract
    Automatic speech recognition depends on large amounts of transcribed speech recordings in order to estimate the parameters of the acoustic model. Recording such large speech corpora is time-consuming and expensive; as a result, sufficient quantities of data exist only for a handful of languages-there are many more languages for which little or no data exist. Given that there are acoustic similarities between speech in different languages, it may be fruitful to use data from a well-resourced source language to estimate the acoustic models for a recognizer in a poorly-resourced target language. Previous approaches to this task have often involved making assumptions about shared phonetic inventories between the languages. Unfortunately pairs of languages do not generally share a common phonetic inventory. We propose an indirect way of transferring information from a source language acoustic model to a target language acoustic model without having to make any assumptions about the phonetic inventory overlap. To do this, we employ tandem features, in which class-posteriors from a separate classifier are decorrelated and appended to conventional acoustic features. Tandem features have the advantage that the language of the speech data used to train the classifier need not be the same as the target language to be recognized. This is because the class-posteriors are not used directly, so do not have to be over any particular set of classes. We demonstrate the use of tandem features in cross-lingual settings, including training on one or several source languages. We also examine factors which may predict a priori how much relative improvement will be brought about by using such tandem features, for a given source and target pair. In addition to conventional phoneme class-posteriors, we also investigate whether articulatory features (AFs)-a multi-stream, discrete, multi-valued labeling of speech-can be used instead. This is motivated by an assumption that AFs are less langua- e-specific than a phoneme set.
  • Keywords
    natural language processing; speech recognition; articulatory feature; conventional phoneme class-posteriors; cross-lingual automatic speech recognition; discrete speech labeling; language acoustic model; multistream speech labeling; multivalued speech labeling; speech corpora; speech data; tandem feature; Data models; Hidden Markov models; Perceptrons; Speech recognition; Automatic speech recognition; multilayer perceptrons;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1558-7916
  • Type

    jour

  • DOI
    10.1109/TASL.2013.2277932
  • Filename
    6578128