DocumentCode :
2662841
Title :
UIMA GRID: Distributed Large-scale Text Analysis
Author :
Egner, Michael Thomas ; Lorch, Markus ; Biddle, Edd
Author_Institution :
Albstadt-Sigmaringen Univ., Albstadt
fYear :
2007
fDate :
14-17 May 2007
Firstpage :
317
Lastpage :
326
Abstract :
This paper shows how loosely coupled compute resources, managed by Condor, can be leveraged together with IBM OmniFind to implement a scalable environment for text analysis based on the Unstructured Information Management Architecture (UIMA). Text analysis can be used to extract valuable knowledge from unstructured text data such as entities and their relationships. When applied to large amounts of data e.g., in the magnitude of several million documents, the process can be too time consuming to react to business needs. This becomes a particular problem when the rule sets, dictionaries, or taxonomies used by the text analysis components are changed to extract new information for a particular business purpose. Such changes may require that the entire set of documents must be reanalyzed. In the scenario motivating this work a constantly growing set of currently 10 million documents needs to frequently be re-processed to accommodate such changes. The text analysis algorithms deployed are very complex and compute intensive, requiring currently about 20 CPU-years for a full re-analysis. Through the distributed architecture discussed in this paper the re-analysis can be performed in one calendar month by opportunistically leveraging compute nodes from a heterogeneous Condor pool.
Keywords :
grid computing; text analysis; IBM OmniFind; UIMA GRID; distributed architecture; distributed large-scale text analysis; heterogeneous Condor pool; information extraction; unstructured information management architecture; Calendars; Computer architecture; Data mining; Dictionaries; Environmental management; Information management; Large-scale systems; Resource management; Taxonomy; Text analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing and the Grid, 2007. CCGRID 2007. Seventh IEEE International Symposium on
Conference_Location :
Rio De Janeiro
Print_ISBN :
0-7695-2833-3
Type :
conf
DOI :
10.1109/CCGRID.2007.118
Filename :
4215396
Link To Document :
بازگشت