A Rule-Based Model for Normalization of SMS Text

Author

Khan, O.A. ; Karim, Asad

Author_Institution

Dept. of Comput. Sci., SBASSE Lahore Univ. of Manage. Sci. (LUMS), Lahore, Pakistan

Volume

1

fYear

2012

fDate

7-9 Nov. 2012

Firstpage

634

Lastpage

641

Abstract

SMS are short-length text documents written in a colloquial style. SMS text processing is challenging because of low signal-to-noise ratio and multi-varied text composition in terms of language, vocabulary, style and quality. These challenges can be overcome by robust text normalization, which is a necessary step before any technique can be applied and evaluated on such data. In this paper, we present a rule-based model for multi-lingual SMS text normalization focusing on messages written in Romanized Urdu and English. Urdu in contrast to English is a morphologically rich language (MRL), i.e. it produces a very large number of word forms for a given root form, while Romanized Urdu is a way of writing Urdu in Latin script which does not follow standard rules for systematic communication. Hence, normalization or standardization of multi-lingual SMS text offers challenges associated with SMS text, multi-lingualism, MRLs and Latin script. Our SMS standardizer is based upon a tuned set of rules that range over various domains of natural language processing, and which tackle the challenges mentioned above effectively. We then implement the standardizer in the application of Keyword Extraction from SMS messages, where it produces significant improvement in performance by upto 23% in F-measure.

Keywords

natural language processing; text analysis; English; F-measure; Latin script; MRL; Romanized Urdu; SMS standardizer; SMS text processing; colloquial style documents; keyword extraction; morphologically rich language; multilingual SMS text normalization; multivaried text composition; natural language processing; rule-based model; short message service; short-length text documents; signal-to-noise ratio; Artificial intelligence; Bismuth; Conferences; Lead; Keyword Extraction; Romanized Urdu; Rule-based model; SMS; Text Normalization;

fLanguage

English

Publisher

ieee

Conference_Titel

Tools with Artificial Intelligence (ICTAI), 2012 IEEE 24th International Conference on

Conference_Location

Athens

ISSN

1082-3409

Print_ISBN

978-1-4799-0227-9

Type

conf

DOI

10.1109/ICTAI.2012.91

Filename

6495103