مرکز منطقه ای اطلاع رساني علوم و فناوري - A framework of human-based speech transcription with a speech chunking front-end

Abstract :

This paper presents a framework of "human-based" speech transcription in a crowdsourcing environment. The main purpose of the framework is to promote participation of a large population of volunteers in speech transcription to create caption data for hearing-impaired people. It allows volunteer participants to join the transcription task with a very short segment of speech, called here as "speech chunk". It is realized by effectively incorporating a front-end of speech chunking prior to the main transcription task. The front-end is intended to increase the flexibility of the transcription task allocation to participants and more importantly to reduce the burden of the task itself by chopping audio data in advance into appropriate length of utterances and accordingly easing the repetitive playback operations. As an initial study, the performance of the speech chunking is investigated for various types of contents on how appropriately speech chunks are extracted as a transcription task unit. The result shows that the framework can be applied even to animation video contents that usually include dynamic sound effects.