Title :
Managing the Google Web 1T 5-gram data set
Author :
Islam, Aminul ; Inkpen, Diana
Author_Institution :
Dept. of Comput. Sci., Univ. of Ottawa, Ottawa, ON, Canada
Abstract :
This paper describes how the Google Web 1T 5-gram data set, contributed by Google Inc., can be stored so that it can be used efficiently with respect to time. We present an efficient way of accessing all the 5-grams for a specific word of interest from the stored files. We measure the maximum access and processing efficiency achievable for any word of interest. We also compare results (access time and memory requirements) on the task of accessing all the 5-grams for a list of words, on both the processed and the original organization of the data set.
Keywords :
Internet; data handling; natural language processing; Google Incorporated; Google Web 1T 5-gram data set; Computer science; Data mining; Frequency; Research and development; Speech recognition; Web pages; 5-grams; Google web 1T; n-gram;
Conference_Titel :
Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on
Conference_Location :
Dalian
Print_ISBN :
978-1-4244-4538-7
Electronic_ISBN :
978-1-4244-4540-0
DOI :
10.1109/NLPKE.2009.5313839