DocumentCode :
672865
Title :
Determination of linguistic differences and statistical analysis of large corpora of Indian languages
Author :
Bansal, Sunny ; Mahajan, Monika ; Agrawal, S.S.
Author_Institution :
KIIT Coll. of Eng., Gurgaon, India
fYear :
2013
fDate :
25-27 Nov. 2013
Firstpage :
1
Lastpage :
5
Abstract :
This paper presents statistical analysis of large corpora for three Indian languages i.e. Hindi, Punjabi and Nepali. The main objective of this study is to analyze the statistical features of these three languages that may be helpful for finding distinction among these languages. Detailed statistical analyses have been done to compute the information about entropy, perplexity, word length, coverage analysis, vocabulary growth rate etc. Most frequently occurring words have been extracted from the text corpus of three languages. Based on statistical features a comparative analysis has been done to find the similarities and differences among these languages.
Keywords :
computational linguistics; entropy; natural languages; statistical analysis; text analysis; Hindi; Indian languages; Nepali; Punjabi; coverage analysis; entropy; large corpora; linguistic difference determination; perplexity; statistical feature analysis; text corpus; vocabulary growth rate; word length; Computational linguistics; Entropy; Pragmatics; Redundancy; Speech; Statistical analysis; Vocabulary; Statistical Analysis of Corpora; Vocabulary growth rate; Zipf´s law;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference
Conference_Location :
Gurgaon
Type :
conf
DOI :
10.1109/ICSDA.2013.6709890
Filename :
6709890
Link To Document :
بازگشت