مرکز منطقه ای اطلاع رساني علوم و فناوري - Word statistics of Turkish language on a large scale text corpus

DocumentCode :

2816119

Title :

Word statistics of Turkish language on a large scale text corpus - TurCo

Author :

Dalkiliç, Gökhan ; Çebi, Yalçin

Author_Institution :

Sch. of Comput. Sci., Central Florida Univ., Orlando, FL, USA

Volume :

fYear :

2004

fDate :

5-7 April 2004

Firstpage :

319

Abstract :

Determination of the statistical properties of a natural language is one of the most important part of the language analysis. Number of different words (NODW), and different word usage ratio (DWUR) concepts are some of the general characteristics of a corpus. These values are described and calculated for the Turkish corpus (TurCo). Also, word n-grams are calculated for Turkish which was done for English years ago but couldn´t be done for Turkish because of the lack of a large scale corpus. Obtained results from n-grams were compared with the results of the Brown corpus (very known corpus for English) and similarity between TurCo and Brown corpus was examined.

Keywords :

computational linguistics; natural languages; statistical analysis; text analysis; word processing; Brown corpus; Turkish corpus; different word usage ratio; natural language analysis; statistical properties; word n-grams; word statistics; Computer science; Large-scale systems; Natural languages; Optical character recognition software; Pattern analysis; Pattern matching; Probability; Speech analysis; Speech recognition; Statistics;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. International Conference on

Print_ISBN :

0-7695-2108-8

Type :

conf

DOI :

10.1109/ITCC.2004.1286654

Filename :

1286654

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2816119