DocumentCode :
560154
Title :
A distributed look-up architecture for text mining applications using MapReduce
Author :
Balkir, Atilla Soner ; Foster, Ian ; Rzhetsky, Andrey
Author_Institution :
Dept. of Comput. Sci., Univ. of Chicago, Chicago, IL, USA
fYear :
2011
fDate :
12-18 Nov. 2011
Firstpage :
1
Lastpage :
11
Abstract :
We study text analysis algorithms that use global optimization methods to compute local characteristics that are consistent with properties of the entire corpus rather than computed locally based on exogenous parameters. In the iterative implementations that we consider, each step both reads and updates a database of parameter values. Motivated by a need for rapid analysis of large corpora, we have developed methods for efficient access to such databases on parallel computers. These methods combine Bloom filters, in-memory caches, and an HBase cluster to reduce communication costs greatly relative to simpler approaches that either fully distribute or fully replicate the database. We also describe how this method can be incorporated into the MapReduce programming model, and illustrate its use within phrase segmentation programs. Our design can achieve considerable run time, latency and storage space improvements relative to other methods. In one phrase segmentation application, we improve performance by a factor of six relative to an HBase-based implementation.
Keywords :
data mining; distributed processing; replicated databases; text analysis; Bloom filters; HBase cluster; MapReduce programming model; communication cost reduction; database replication; distributed look-up architecture; global optimization methods; in-memory caches; local characteristics computation; parallel computers; phrase segmentation programs; text analysis algorithms; text mining applications; Computational modeling; Data models; Distributed databases; Search engines; Text mining; Time frequency analysis; Distributed Storage; MapReduce; Text Mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for
Conference_Location :
Seatle, WA
Electronic_ISBN :
978-1-4503-0771-0
Type :
conf
Filename :
6114419
Link To Document :
بازگشت