• DocumentCode
    1114985
  • Title

    n-Gram Statistics for Natural Language Understanding and Text Processing

  • Author

    Suen, Ching Y.

  • Author_Institution
    SENIOR MEMBER, IEEE, Department of Computer Science, Concordia University, Montreal, P.Q., Canada; Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA 02139.
  • Issue
    2
  • fYear
    1979
  • fDate
    4/1/1979 12:00:00 AM
  • Firstpage
    164
  • Lastpage
    172
  • Abstract
    n-gram (n = 1 to 5) statistics and other properties of the English language were derived for applications in natural language understanding and text processing. They were computed from a well-known corpus composed of 1 million word samples. Similar properties were also derived from the most frequent 1000 words of three other corpuses. The positional distributions of n-grams obtained in the present study are discussed. Statistical studies on word length and trends of n-gram frequencies versus vocabulary are presented. In addition to a survey of n-gram statistics found in the literature, a collection of n-gram statistics obtained by other researchers is reviewed and compared.
  • Keywords
    Application software; Error correction; Frequency; Humans; Natural languages; Optical character recognition software; Statistical distributions; Statistics; Text processing; Vocabulary; Character recognition; context; language understanding; n-gram statistics; positional distributions of letters; text processing; word length analysis;
  • fLanguage
    English
  • Journal_Title
    Pattern Analysis and Machine Intelligence, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0162-8828
  • Type

    jour

  • DOI
    10.1109/TPAMI.1979.4766902
  • Filename
    4766902