DocumentCode :
1904995
Title :
A Rule-Based Model for Normalization of SMS Text
Author :
Khan, O.A. ; Karim, Asad
Author_Institution :
Dept. of Comput. Sci., SBASSE Lahore Univ. of Manage. Sci. (LUMS), Lahore, Pakistan
Volume :
1
fYear :
2012
fDate :
7-9 Nov. 2012
Firstpage :
634
Lastpage :
641
Abstract :
SMS are short-length text documents written in a colloquial style. SMS text processing is challenging because of low signal-to-noise ratio and multi-varied text composition in terms of language, vocabulary, style and quality. These challenges can be overcome by robust text normalization, which is a necessary step before any technique can be applied and evaluated on such data. In this paper, we present a rule-based model for multi-lingual SMS text normalization focusing on messages written in Romanized Urdu and English. Urdu in contrast to English is a morphologically rich language (MRL), i.e. it produces a very large number of word forms for a given root form, while Romanized Urdu is a way of writing Urdu in Latin script which does not follow standard rules for systematic communication. Hence, normalization or standardization of multi-lingual SMS text offers challenges associated with SMS text, multi-lingualism, MRLs and Latin script. Our SMS standardizer is based upon a tuned set of rules that range over various domains of natural language processing, and which tackle the challenges mentioned above effectively. We then implement the standardizer in the application of Keyword Extraction from SMS messages, where it produces significant improvement in performance by upto 23% in F-measure.
Keywords :
natural language processing; text analysis; English; F-measure; Latin script; MRL; Romanized Urdu; SMS standardizer; SMS text processing; colloquial style documents; keyword extraction; morphologically rich language; multilingual SMS text normalization; multivaried text composition; natural language processing; rule-based model; short message service; short-length text documents; signal-to-noise ratio; Artificial intelligence; Bismuth; Conferences; Lead; Keyword Extraction; Romanized Urdu; Rule-based model; SMS; Text Normalization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Tools with Artificial Intelligence (ICTAI), 2012 IEEE 24th International Conference on
Conference_Location :
Athens
ISSN :
1082-3409
Print_ISBN :
978-1-4799-0227-9
Type :
conf
DOI :
10.1109/ICTAI.2012.91
Filename :
6495103
Link To Document :
بازگشت