مرکز منطقه ای اطلاع رساني علوم و فناوري - Discovering Aspectual Classes of Russian Verbs in Untagged Large Corpora

DocumentCode :

3739798

Title :

Discovering Aspectual Classes of Russian Verbs in Untagged Large Corpora

Author :

Aleksandr Drozd;Anna Gladkova;Satoshi Matsuoka

Author_Institution :

Global Sci. Inf. &

fYear :

2015

Firstpage :

Lastpage :

Abstract :

This paper presents a case study of discovering and classifying verbs in large web-corpora. Many tasks in natural language processing require corpora containing billions of words, and with such volumes of data co-occurrence extraction becomes one of the performance bottlenecks in the Vector Space Models of computational linguistics. We propose a co-occurrence extraction kernel based on ternary trees as an alternative (or a complimentary stage) to conventional map-reduce based approach, this kernel achieves an order of magnitude improvement in memory footprint and processing speed. Our classifier successfully and efficiently identified verbs in a 1.2-billion words untagged corpus of Russian fiction and distinguished between their two aspectual classes. The model proved efficient even for low-frequency vocabulary, including nonce verbs and neologisms.

Keywords :

"Context","Pragmatics","Semantics","Syntactics","Internet","Electronic mail","Data models"

Publisher :

ieee

Conference_Titel :

Data Science and Data Intensive Systems (DSDIS), 2015 IEEE International Conference on

Type :

conf

DOI :

10.1109/DSDIS.2015.30

Filename :

7396482

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3739798