Determination of linguistic differences and statistical analysis of large corpora of Indian languages

Author

Bansal, Sunny ; Mahajan, Monika ; Agrawal, S.S.

Author_Institution

KIIT Coll. of Eng., Gurgaon, India

fYear

2013

fDate

25-27 Nov. 2013

Firstpage

1

Lastpage

5

Abstract

This paper presents statistical analysis of large corpora for three Indian languages i.e. Hindi, Punjabi and Nepali. The main objective of this study is to analyze the statistical features of these three languages that may be helpful for finding distinction among these languages. Detailed statistical analyses have been done to compute the information about entropy, perplexity, word length, coverage analysis, vocabulary growth rate etc. Most frequently occurring words have been extracted from the text corpus of three languages. Based on statistical features a comparative analysis has been done to find the similarities and differences among these languages.

Keywords

computational linguistics; entropy; natural languages; statistical analysis; text analysis; Hindi; Indian languages; Nepali; Punjabi; coverage analysis; entropy; large corpora; linguistic difference determination; perplexity; statistical feature analysis; text corpus; vocabulary growth rate; word length; Computational linguistics; Entropy; Pragmatics; Redundancy; Speech; Statistical analysis; Vocabulary; Statistical Analysis of Corpora; Vocabulary growth rate; Zipf´s law;

fLanguage

English

Publisher

ieee

Conference_Titel

Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference

Conference_Location

Gurgaon

Type

conf

DOI

10.1109/ICSDA.2013.6709890

Filename

6709890