• DocumentCode
    672865
  • Title

    Determination of linguistic differences and statistical analysis of large corpora of Indian languages

  • Author

    Bansal, Sunny ; Mahajan, Monika ; Agrawal, S.S.

  • Author_Institution
    KIIT Coll. of Eng., Gurgaon, India
  • fYear
    2013
  • fDate
    25-27 Nov. 2013
  • Firstpage
    1
  • Lastpage
    5
  • Abstract
    This paper presents statistical analysis of large corpora for three Indian languages i.e. Hindi, Punjabi and Nepali. The main objective of this study is to analyze the statistical features of these three languages that may be helpful for finding distinction among these languages. Detailed statistical analyses have been done to compute the information about entropy, perplexity, word length, coverage analysis, vocabulary growth rate etc. Most frequently occurring words have been extracted from the text corpus of three languages. Based on statistical features a comparative analysis has been done to find the similarities and differences among these languages.
  • Keywords
    computational linguistics; entropy; natural languages; statistical analysis; text analysis; Hindi; Indian languages; Nepali; Punjabi; coverage analysis; entropy; large corpora; linguistic difference determination; perplexity; statistical feature analysis; text corpus; vocabulary growth rate; word length; Computational linguistics; Entropy; Pragmatics; Redundancy; Speech; Statistical analysis; Vocabulary; Statistical Analysis of Corpora; Vocabulary growth rate; Zipf´s law;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference
  • Conference_Location
    Gurgaon
  • Type

    conf

  • DOI
    10.1109/ICSDA.2013.6709890
  • Filename
    6709890