DocumentCode :
1950526
Title :
Detecting synthetic speech using long term magnitude and phase information
Author :
Xiaohai Tian ; Du, Steven ; Xiong Xiao ; Haihua Xu ; Eng Siong Chng ; Haizhou Li
Author_Institution :
Sch. of Comput. Eng., Nanyang Technol. Univ. (NTU), Singapore, Singapore
fYear :
2015
fDate :
12-15 July 2015
Firstpage :
611
Lastpage :
615
Abstract :
Synthetic speech is speech signals generated by text-to-speech (TTS) and voice conversion (VC) techniques. They impose a threat to speaker verification (SV) systems as an attacker may make use of TTS or VC to synthesize a speakers voice to cheat the SV system. To address this challenge, we study the detection of synthetic speech using long term magnitude and phase information of speech. As most of the TTS and VC techniques make use of vocoders for speech analysis and synthesis, we focus on differentiating speech signals generated by vocoders from natural speech. Log magnitude spectrum and two phase-based features, including instantaneous frequency derivation and modified group delay, were studied in this work. We conducted experiments on the CMU-ARCTIC database using various speech features and a neural network classifier. During training, the synthetic speech detection is formulated as a 2-class classification problem and the neural network is trained to differentiate synthetic speech from natural speech. During testing, the posterior scores generated by the neural network is used for the detection of synthetic speech. The synthetic speech used in training and testing are generated by different types of vocoders and VC methods. Experimental results show that long term information up to 0.3s is important for synthetic speech detection. In addition, the high dimensional log magnitude spectrum features significantly outperforms the low dimensional MFCC features, showing that it is important to retain the detailed spectral information for detecting synthetic speech. Furthermore, the two phase-based features are found to perform well and complementary to the log magnitude spectrum features. The fusion of these features produces an equal error rate (EER) of 0.09%.
Keywords :
error statistics; feature extraction; neural nets; signal classification; speaker recognition; spectral analysis; speech coding; speech synthesis; vocoders; 2-class classification problem; CMU-ARCTIC database; EER; SV systems; TTS techniques; VC techniques; attacker; equal error rate; group delay; instantaneous frequency derivation; log magnitude spectrum; long term magnitude; low dimensional MFCC features; natural speech; neural network classifier; phase information; phase-based features; posterior scores; speaker verification; speakers voice; spectral information; speech features; speech signals; synthetic speech detection; text-to-speech; vocoders; voice conversion; Context; Feature extraction; Mel frequency cepstral coefficient; Natural languages; Speech; Speech processing; Spoofing attack; instantaneous frequency; voice conversion;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Signal and Information Processing (ChinaSIP), 2015 IEEE China Summit and International Conference on
Conference_Location :
Chengdu
Type :
conf
DOI :
10.1109/ChinaSIP.2015.7230476
Filename :
7230476
Link To Document :
بازگشت