Title :
Index-based n-gram extraction from large document collections
Author :
Michal Krátký;Radim Bača;David Bednář;Jiří Walder;Jiří Dvorský;Peter Chovanec
Author_Institution :
Department of Computer Science, VŠ
Abstract :
N-grams are applied in some applications searching in text documents, especially in cases when one must work with phrases, e.g. in plagiarism detection. N-gram is a sequence of n terms (or generally tokens) from a document. We get a set of n-grams by moving a floating window from the begin to the end of the document. During the extraction we must remove duplicate n-grams and we must store additional values to each n-gram type, e.g. n-gram type frequency for each document and so on, it depends on a query model used. Previous works utilize a sorting algorithm to compute the n-gram frequency. These approaches must handle a high number of the same n-grams resulting in high time and space overhead. Moreover, these techniques are often main-memory only, it means they must be executed for small or middle size collections. In this paper, we show an index-based method to the n-gram extraction for large collections. This method utilizes common data structures like B+-tree and Hash table. We show the scalability of our method by presenting experiments with the gigabytes collection.
Keywords :
"Indexes","Arrays","Sorting","Complexity theory","Servers","Scalability"
Conference_Titel :
Digital Information Management (ICDIM), 2011 Sixth International Conference on
Print_ISBN :
978-1-4577-1538-9
DOI :
10.1109/ICDIM.2011.6093324