مرکز منطقه ای اطلاع رساني علوم و فناوري - Teraman: A Tool for N-gram Extraction from Large Datasets

DocumentCode :

1816046

Title :

Teraman: A Tool for N-gram Extraction from Large Datasets

Author :

Ceska, Zdenek ; Hanak, Ivo ; Tesar, Roman

Author_Institution :

West Bohemia Univ., Pilsen

fYear :

2007

fDate :

6-8 Sept. 2007

Firstpage :

209

Lastpage :

216

Abstract :

In natural language processing (NLP) mainly single words are utilized to represent text documents. Recent studies have shown that this approach can be often improved by employing other, more sophisticated, features. Among them, mainly N-grams have been successfully used for this purpose and many algorithms and procedures for their extraction have been proposed. However, usually they are not primarily intended for large data processing, which has currently become a critical task. In this paper we present an algorithm for N-gram extraction from huge datasets. The experiments indicate that our approach reaches outstanding results among other available solutions in terms of speed and amount of processed data.

Keywords :

data mining; natural language processing; text analysis; very large databases; N-gram extraction; Teraman tool; large datasets; natural language processing; text documents; Biological system modeling; Concurrent computing; Data mining; Data processing; Frequency; Genetics; Internet; Natural language processing; Text categorization; Text recognition;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Intelligent Computer Communication and Processing, 2007 IEEE International Conference on

Conference_Location :

Cluj-Napoca

Print_ISBN :

978-1-4244-1491-8

Type :

conf

DOI :

10.1109/ICCP.2007.4352162

Filename :

4352162

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1816046