مرکز منطقه ای اطلاع رساني علوم و فناوري - Index-based n-gram extraction from large document collections

DocumentCode :

3644689

Title :

Index-based n-gram extraction from large document collections

Author :

Michal Krátký;Radim Bača;David Bednář;Jiří Walder;Jiří Dvorský;Peter Chovanec

Author_Institution :

Department of Computer Science, VŠ

fYear :

2011

Firstpage :

Lastpage :

Abstract :

N-grams are applied in some applications searching in text documents, especially in cases when one must work with phrases, e.g. in plagiarism detection. N-gram is a sequence of n terms (or generally tokens) from a document. We get a set of n-grams by moving a floating window from the begin to the end of the document. During the extraction we must remove duplicate n-grams and we must store additional values to each n-gram type, e.g. n-gram type frequency for each document and so on, it depends on a query model used. Previous works utilize a sorting algorithm to compute the n-gram frequency. These approaches must handle a high number of the same n-grams resulting in high time and space overhead. Moreover, these techniques are often main-memory only, it means they must be executed for small or middle size collections. In this paper, we show an index-based method to the n-gram extraction for large collections. This method utilizes common data structures like B⁺-tree and Hash table. We show the scalability of our method by presenting experiments with the gigabytes collection.

Keywords :

"Indexes","Arrays","Sorting","Complexity theory","Servers","Scalability"

Publisher :

ieee

Conference_Titel :

Digital Information Management (ICDIM), 2011 Sixth International Conference on

ISSN :

Pending

Print_ISBN :

978-1-4577-1538-9

Type :

conf

DOI :

10.1109/ICDIM.2011.6093324

Filename :

6093324

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3644689