Title : 
Utilizing social media data through similarity-based text normalization for LVCSR language modeling
         
        
            Author : 
Chotimongkol, Ananlada ; Thangthai, Kwanchiva ; Wutiwiwatchai, Chai
         
        
            Author_Institution : 
Nat. Electron. & Comput. Technol. Center, Pathum Thani, Thailand
         
        
        
        
        
            Abstract : 
In this paper, we explore the use of social media data in augmenting the lack of large prepared text corpora for LVCSR language modeling. Extensive normalization is required to handle informal and noisy nature of social media text. We propose a similarity-based text normalization approach where similarity in terms of spelling, pronunciation and context are considered. Similarity between a source (nonstandard) word and a target (normalized) word is measured by edit distance and Kullback-Leibler distance. The proposed normalization method can handle the case of homophonic, spelling error and insertion (repeated characters) which occur quite often in Twitter´s texts. We then trained n-gram language models with the normalized texts and achieved up to 60% relative improvement in terms of perplexity and 9% in terms of WER on a mobile speech-to-speech translation task. The proposed approach is applicable to other types of social media texts by its unsupervised manner.
         
        
            Keywords : 
social networking (online); speech recognition; text analysis; vocabulary; Kullback-Leibler distance; LVCSR language modeling; Twitter text; edit distance; homophonic case; mobile speech-to-speech translation task; n-gram language model; nonstandard word; normalization method; normalized text; normalized word; pronunciation; repeated character; similarity-based text normalization; social media data; spelling error case; text corpora; Accuracy; Context; Data models; Media; Mobile communication; Speech; Twitter; Kullback-Leibler distance; LVCSR; edit distance; language modeling; social media; text normalization;
         
        
        
        
            Conference_Titel : 
Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA), 2014 17th Oriental Chapter of the International Committee for the
         
        
        
            DOI : 
10.1109/ICSDA.2014.7051432