Author/Authors :
özel, selma ayşe çukurova university - department of computer engineering, Turkey , bektaş, yasin çukurova university - department of electrical and electronics engineering, Turkey , yilmazer, hakan çukurova university - department of computer engineering, Turkey
Abstract :
Formulaic sequences are the most frequently occurred forms in a language. Identification of formulaic sequences in language is useful for a wide range of areas including linguistics, second language learning, natural language processing, etc. To identify formulaic sequences in a language, the most preferred method is to use a corpus, which may be formed from written texts or tape-recorded conversations in the language, and count the frequencies of sequences in the corpus. Then, most frequently occurring sequences are examined to find formulas. Numerous studies have been made to identify formulas for several languages like English. There exists only fewstudies about formulaicity in Turkish and most of these studies focus onidentifying formulas in the forms of multi word units. Turkish, however, is anagglutinating language having a rich and complex morphology, thereforeformulaic sequences in affixation should be discovered. Only very limited studies about formulaicity in affixation of Turkish exist in the literature. Inthis study, we try to discover formulaic sequences in affixation of Turkish bycounting frequent suffix n-grams in written and spoken Turkish by using theTurkish National Corpus, which is a balanced, large scale, and general-purpose corpus for contemporary Turkish. We list the most frequent suffix combinations not only for verbs but also for all lexical categories like noun, adjective, verb, and adverb for both written and spoken corpora from Turkish National Corpus, and discuss similarities and differences in affixation in written and spoken usage of Turkish. We observe that, we prefershorter suffix sequences in spoken Turkish than in written Turkish, and as the length of the suffix n-grams increase, we use different formulaic sequences in written and spoken Turkish.
Keywords :
Frequent suffix n , grams , written Turkish , spoken Turkish , Turkish National Corpus