• DocumentCode
    81019
  • Title

    TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech

  • Author

    Harte, Naomi ; Gillen, Eoin

  • Author_Institution
    Dept. of Electron. & Electr. Eng., Trinity Coll. Dublin, Dublin, Ireland
  • Volume
    17
  • Issue
    5
  • fYear
    2015
  • fDate
    May-15
  • Firstpage
    603
  • Lastpage
    615
  • Abstract
    Automatic audio-visual speech recognition currently lags behind its audio-only counterpart in terms of major progress. One of the reasons commonly cited by researchers is the scarcity of suitable research corpora. This paper details the creation of a new corpus designed for continuous audio-visual speech recognition research . TCD-TIMIT consists of high-quality audio and video footage of 62 speakers reading a total of 6913 phonetically rich sentences. Three of the speakers are professionally-trained lipspeakers, recorded to test the hypothesis that lipspeakers may have an advantage over regular speakers in automatic visual speech recognition systems. Video footage was recorded from two angles: straight on, and at 30°. The paper outlines the recording of footage, and the required post-processing to yield video and audio clips for each sentence. Audio, visual, and joint audio-visual baseline experiments are reported. Separate experiments were run on the lipspeaker and non-lipspeaker data, and the results compared. Visual and audio-visual baseline results on the non-lipspeakers were low overall. Results on the lipspeakers were found to be significantly higher. It is hoped that as a publicly available database, TCD-TIMIT will now help further state of the art in audio-visual speech recognition research.
  • Keywords
    audio-visual systems; speech recognition; video signal processing; TCD-TIMIT; audio clips; audio-visual baseline experiments; audio-visual corpus; automatic visual speech recognition systems; continuous audio-visual speech recognition research; continuous speech; high-quality audio footage; high-quality video footage; professionally-trained lipspeakers; video clips; Cameras; Dictionaries; Speech; Speech recognition; Visual databases; Visualization; Audio-visual speech recognition;
  • fLanguage
    English
  • Journal_Title
    Multimedia, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1520-9210
  • Type

    jour

  • DOI
    10.1109/TMM.2015.2407694
  • Filename
    7050271