DocumentCode :
1161131
Title :
Advances in speech transcription at IBM under the DARPA EARS program
Author :
Chen, Stanley F. ; Kingsbury, Brian ; Mangu, Lidia ; Povey, Daniel ; Saon, George ; Soltau, Hagen ; Zweig, Geoffrey
Author_Institution :
IBM T. J. Watson Res. Center, Yorktown Heights, NY
Volume :
14
Issue :
5
fYear :
2006
Firstpage :
1596
Lastpage :
1608
Abstract :
This paper describes the technical and system building advances made in IBM´s speech recognition technology over the course of the Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) program. At a technical level, these advances include the development of a new form of feature-based minimum phone error training (fMPE), the use of large-scale discriminatively trained full-covariance Gaussian models, the use of septaphone acoustic context in static decoding graphs, and improvements in basic decoding algorithms. At a system building level, the advances include a system architecture based on cross-adaptation and the incorporation of 2100 h of training data in every system component. We present results on English conversational telephony test data from the 2003 and 2004 NIST evaluations. The combination of technical advances and an order of magnitude more training data in 2004 reduced the error rate on the 2003 test set by approximately 21% relative-from 20.4% to 16.1%-over the most accurate system in the 2003 evaluation and produced the most accurate results on the 2004 test sets in every speed category
Keywords :
Gaussian processes; covariance analysis; error statistics; speech coding; speech recognition; speech synthesis; DARPA EARS program; Defense Advanced Research Projects Agency; English conversational telephony test data; IBM speech recognition technology; NIST evaluations; basic decoding algorithm; effective affordable reusable speech-to-text program; error rate; feature-based minimum phone error training; large-scale discriminatively trained full-covariance Gaussian models; septaphone acoustic context; speech transcription; static decoding graphs; system building advances; technical building advances; Acoustic testing; Buildings; Context modeling; Decoding; Ear; Large-scale systems; Speech recognition; System testing; Telephony; Training data; Discriminative training; Effective Affordable Reusable Speech-to-Text (EARS); Viterbi decoding; finite-state transducer; full covariance modeling; large-vocabulary conversational speech recognition;
fLanguage :
English
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
Publisher :
ieee
ISSN :
1558-7916
Type :
jour
DOI :
10.1109/TASL.2006.879814
Filename :
1677980
Link To Document :
بازگشت