DocumentCode
672865
Title
Determination of linguistic differences and statistical analysis of large corpora of Indian languages
Author
Bansal, Sunny ; Mahajan, Monika ; Agrawal, S.S.
Author_Institution
KIIT Coll. of Eng., Gurgaon, India
fYear
2013
fDate
25-27 Nov. 2013
Firstpage
1
Lastpage
5
Abstract
This paper presents statistical analysis of large corpora for three Indian languages i.e. Hindi, Punjabi and Nepali. The main objective of this study is to analyze the statistical features of these three languages that may be helpful for finding distinction among these languages. Detailed statistical analyses have been done to compute the information about entropy, perplexity, word length, coverage analysis, vocabulary growth rate etc. Most frequently occurring words have been extracted from the text corpus of three languages. Based on statistical features a comparative analysis has been done to find the similarities and differences among these languages.
Keywords
computational linguistics; entropy; natural languages; statistical analysis; text analysis; Hindi; Indian languages; Nepali; Punjabi; coverage analysis; entropy; large corpora; linguistic difference determination; perplexity; statistical feature analysis; text corpus; vocabulary growth rate; word length; Computational linguistics; Entropy; Pragmatics; Redundancy; Speech; Statistical analysis; Vocabulary; Statistical Analysis of Corpora; Vocabulary growth rate; Zipf´s law;
fLanguage
English
Publisher
ieee
Conference_Titel
Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference
Conference_Location
Gurgaon
Type
conf
DOI
10.1109/ICSDA.2013.6709890
Filename
6709890
Link To Document