DocumentCode
3632023
Title
Turkish broadcast news transcription with open-source software
Author
Dogan Can;Murat Saraclar
Author_Institution
Elektrik Elektronik M?hendisli?i B?l?m?, Bo?azi?i ?niversitesi, 34342, Bebek, ?stanbul, T?rkiye
fYear
2009
fDate
4/1/2009 12:00:00 AM
Firstpage
325
Lastpage
328
Abstract
In this paper, we present our Turkish large vocabulary continuous speech recognition (LVCSR) system, which is based on open-source software (HTK, SRILM) and which utilizes 187 hours of Turkish broadcast news data as well as a 184 million-word text corpus collected from various Turkish news portals. Within this system, three different acoustic models optimizing ML, MMI and MPE criteria were developed and the contribution of discriminative acoustic modeling to Turkish LVCSR was investigated. Recognition experiments utilizing a tri-gram language model with 50 K vocabulary give word error rates of 25.8% with ML, 24.3% with MMI and finally 23.7% with MPE.
Keywords
"Open source software","Broadcasting","Vocabulary","Maximum likelihood estimation","Speech recognition","Portals","Error analysis","Mutual information","Lattices"
Publisher
ieee
Conference_Titel
Signal Processing and Communications Applications Conference, 2009. SIU 2009. IEEE 17th
ISSN
2165-0608
Print_ISBN
978-1-4244-4435-9
Type
conf
DOI
10.1109/SIU.2009.5136398
Filename
5136398
Link To Document