• DocumentCode
    2179331
  • Title

    Toward text message normalization: Modeling abbreviation generation

  • Author

    Pennell, Deana ; Liu, Yang

  • Author_Institution
    Comput. Sci. Dept., Univ. of Texas at Dallas, Dallas, TX, USA
  • fYear
    2011
  • fDate
    22-27 May 2011
  • Firstpage
    5364
  • Lastpage
    5367
  • Abstract
    This paper describes a text normalization system for deletion-based abbreviations in informal text. We propose using statistical classifiers to learn the probability of deleting a given character using features based on character context, position in the word and containing syllable, and function within the word. To ensure that our system is robust to different and previously unseen abbreviations for a word, we generate multiple abbreviation hypotheses for a word using the predictions from the classifiers. We then reverse the mappings to enable recovery of English words from the abbreviations. Different knowledge sources are used to disambiguate word candidates: abbreviation likelihood, length, and language model scores. Our results show that this approach is feasible and warrants further exploration in the future.
  • Keywords
    electronic messaging; probability; speech synthesis; text analysis; word processing; English word; SMS; abbreviation likelihood; character context; deletion-based abbreviation; probability; toward text message normalization; Computational modeling; Context; Decoding; Error analysis; Hidden Markov models; Mathematical model; Twitter; abbreviation modeling; noisy text processing; text normalization; twitter;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on
  • Conference_Location
    Prague
  • ISSN
    1520-6149
  • Print_ISBN
    978-1-4577-0538-0
  • Electronic_ISBN
    1520-6149
  • Type

    conf

  • DOI
    10.1109/ICASSP.2011.5947570
  • Filename
    5947570