Text studies towards multi-lingual content mining for web communication

Author

Prakash, Kolla Bhanu ; Rangaswamy, M.A.D. ; Raman, Arun Raja

Author_Institution

Sathyabama Univ., Chennai, India

fYear

2010

fDate

17-19 Dec. 2010

Firstpage

28

Lastpage

31

Abstract

Communication through web is becoming increasingly popular thanks to wireless and cellular networks. As this awareness spreads far and wide in different countries, significant complexities arise in terms of language and communication means for extracting information on the web. This is particularly true in India where more than fifteen officially recognized language texts and more variations in local dialect exist. An example is in Tamilnadu where Tamizh, native language with its own variations like Chennai, Madurai and Coimbatore dialects is combined effectively and easily with other languages Telugu, Kannada and Malayalam from nearby states and English and Hindi from global and national perspectives. So a web document here could be in any one of the languages or a mixture of words from different languages to avoid translation like `computer´ of English doesn´t have translation in Tamizh. There are several aspects to this variational usage with language protagonists and communication engineers. But the complexity in the web document due to these variations does create difficulties in using conventional data mining approaches. The present study focuses attention on this, beginning from text variations to word and document. Typical characters which have similar usage like `a´ in English with those in Tamizh and Telugu are taken and their pixelmaps are mapped for similarity and contrasts. This is later extended to more complex characters like in Telugu which is one character as compared to its English equivalent `kO´ making representations difficult. When one starts looking at words, complexity increases as `temple´ in English translated as in Telugu or mandiram written in English. Similarities in pixel-maps are looked at and characteristics in terms of matrices are projected so that mining content when such words or letters are extracted in web document can be put in a probabilistic format with predictions based on correlations. Typical histograms highlighting t- - hese aspects are presented and later an experiment with a document page dealing with magnetism is used as model-l for predicting content.

Keywords

Internet; data mining; natural language processing; text analysis; Coimbatore dialects; Madurai dialects; Tamilnadu; Tamizh; cellular networks; communication engineers; data mining; information extraction; language protagonists; text studies towards multilingual content mining; web communication; wireless networks; Complexity theory; Computers; Data mining; Feature extraction; Multimedia databases; Pixel; Web pages;

fLanguage

English

Publisher

ieee

Conference_Titel

Trendz in Information Sciences & Computing (TISC), 2010

Conference_Location

Chennai

Print_ISBN

978-1-4244-9007-3

Type

conf

DOI

10.1109/TISC.2010.5714601

Filename

5714601