Title :
Turkish labeled text corpus
Author :
Özturk, Seçil ; Sankur, B. ; Gungör, Tunga ; Yilmaz, Mustafa Berkay ; Köroǧlu, Bilge ; Aǧin, Onur ; İşbilen, Mustafa ; Ulaş, Çaǧdaş ; Ahat, Mehmet
Author_Institution :
Elektr. Elektron., Muhendisligi Bolumleri, Bogazici Univ., Istanbul, Turkey
Abstract :
A labeled text corpus made up of Turkish papers´ titles, abstracts and keywords is collected. The corpus includes 35 number of different disciplines, and 200 documents per subject. This study presents the text corpus´ collection and content. The classification performance of Term Frequcney - Inverse Document Frequency (TF-IDF) and topic probabilities of Latent Dirichlet Allocation (LDA) features are compared for the text corpus. The text corpus is shared as open source so that it could be used for natural language processing applications with academic purposes.
Keywords :
natural language processing; pattern classification; probability; text analysis; LDA features; TF-IDF; Turkish labeled text corpus; Turkish paper abstracts; Turkish paper keywords; Turkish paper titles; academic purposes; classification performance; latent Dirichlet allocation features; natural language processing applications; term frequency-inverse document frequency; text corpus collection; text corpus content; topic probabilities; Abstracts; Conferences; Natural language processing; Resource management; Signal processing; Support vector machines; XML; Classification; Corpus; Inverse Document Frequency; Latent Dirichlet Allocation; NLP; Natural Language Processing; Paper; TF-IDF; Term Frequcney; Turkish;
Conference_Titel :
Signal Processing and Communications Applications Conference (SIU), 2014 22nd
Conference_Location :
Trabzon
DOI :
10.1109/SIU.2014.6830499