مرکز منطقه ای اطلاع رساني علوم و فناوري - Encoding Navigable Speech Sources: A Psychoacoustic-Based Analysis-by-Synthesis Approach

DocumentCode :

28860

Title :

Encoding Navigable Speech Sources: A Psychoacoustic-Based Analysis-by-Synthesis Approach

Author :

Zheng, Xiguang ; Ritz, Christian ; Xi, Jiangtao

Author_Institution :

ICT Res. Inst. & Sch. of Electr. Comput. & Telecommun. Eng., Univ. of Wollongong, Wollongong, NSW, Australia

Volume :

Issue :

fYear :

2013

fDate :

Jan. 2013

Firstpage :

Lastpage :

Abstract :

This paper presents a psychoacoustic-based analysis-by-synthesis approach for compressing navigable speech sources. The approach targets multi-party teleconferencing applications, where selective reproduction of individual speech sources is desired. Based on exploiting sparsity of speech in the perceptual time-frequency domain, multiple speech signals are encoded into one mono mixture signal, which can be further compressed using a standard speech codec. Using side information indicating the active speech source for each time frequency instant enables flexible decoding and reproduction. Objective results highlight the importance of considering perception when exploiting the sparse nature of speech in the time-frequency domain. Results show that this sparsity, as measured by the preserved energy level of perceptually important time-frequency components extracted from mixtures of speech signals, is similar in both anechoic and reverberant environments. The proposed approach is applied to a series of simulated and real reverberant speech recordings, where the resulting speech mixtures are compressed using a standard speech codec operating at 32 kbps. The perceptual quality, as judged both by objective and subjective evaluations, outperforms a simple sparsity approach that does not consider perception as well as the approach that encodes each source separately. While the perceptual quality of individual speech sources is maintained, subjective tests also confirm the approach maintains the perceptual quality of the spatialized speech scene.

Keywords :

data compression; speech codecs; speech coding; speech synthesis; teleconferencing; time-frequency analysis; ίexible decoding; ίexible reproduction; anechoic environments; encoding navigable speech sources; energy level; mono mixture signal; multiparty teleconferencing appli- cations; navigable speech source compression; perceptual quality; psychoacoustic-based analysis-by-synthesis approach; real reverberant speech recordings; reverberant environments; standard speech codec; time-frequency domain; Audio coding; Reverberation; Speech; Speech coding; Teleconferencing; Time frequency analysis; Multichannel speech coding; soundfield navigation; spatial teleconferencing;

fLanguage :

English

Journal_Title :

Audio, Speech, and Language Processing, IEEE Transactions on

Publisher :

ieee

ISSN :

1558-7916

Type :

jour

DOI :

10.1109/TASL.2012.2211015

Filename :

6256702

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=28860