• DocumentCode
    1057791
  • Title

    Automatic Word Spacing Using Probabilistic Models Based on Character n-grams

  • Author

    Lee, Do-Gil ; Rim, Hae-Chang ; Yook, Dongsuk

  • Author_Institution
    Korea Univ., Seoul
  • Volume
    22
  • Issue
    1
  • fYear
    2007
  • Firstpage
    28
  • Lastpage
    35
  • Abstract
    On the Internet, information is largely in text form, which often includes such errors as spelling mistakes. These errors complicate natural language processing because most NLP applications aren´t robust and assume that the input data is noise free. Preprocessing is necessary to deal with these errors and meet the growing need for automatic text processing. One kind of such preprocessing is automatic word spacing. This process decides correct boundaries between words in a sentence containing spacing errors, which are a type of spelling error. Except for some Asian languages such as Chinese and Japanese, most languages have explicit word spacing. In these languages, word spacing is crucial to increase readability and to accurately communicate a text´s meaning. Automatic word spacing plays an important role not only as a spell-checker module but also as a preprocessor for a morphological analyzer, which is a fundamental tool for NLP applications. Furthermore, automatic word spacing can serve as a postprocessor for optical-character-recognition systems and speech recognition systems
  • Keywords
    natural language processing; optical character recognition; speech recognition; spelling aids; text analysis; automatic text processing; automatic word spacing; natural language processing; optical character recognition system; speech recognition system; spell-checker module; spelling error; spelling mistakes; Error correction; Hidden Markov models; Internet; Natural language processing; Natural languages; Noise robustness; Probability; Speech recognition; Tagging; Text processing; hidden Markov models; machine learning; n-gram; probabilistic models; word spacing;
  • fLanguage
    English
  • Journal_Title
    Intelligent Systems, IEEE
  • Publisher
    ieee
  • ISSN
    1541-1672
  • Type

    jour

  • DOI
    10.1109/MIS.2007.4
  • Filename
    4078953