Title :
UIMA GRID: Distributed Large-scale Text Analysis
Author :
Egner, Michael Thomas ; Lorch, Markus ; Biddle, Edd
Author_Institution :
Albstadt-Sigmaringen Univ., Albstadt
Abstract :
This paper shows how loosely coupled compute resources, managed by Condor, can be leveraged together with IBM OmniFind to implement a scalable environment for text analysis based on the Unstructured Information Management Architecture (UIMA). Text analysis can be used to extract valuable knowledge from unstructured text data such as entities and their relationships. When applied to large amounts of data e.g., in the magnitude of several million documents, the process can be too time consuming to react to business needs. This becomes a particular problem when the rule sets, dictionaries, or taxonomies used by the text analysis components are changed to extract new information for a particular business purpose. Such changes may require that the entire set of documents must be reanalyzed. In the scenario motivating this work a constantly growing set of currently 10 million documents needs to frequently be re-processed to accommodate such changes. The text analysis algorithms deployed are very complex and compute intensive, requiring currently about 20 CPU-years for a full re-analysis. Through the distributed architecture discussed in this paper the re-analysis can be performed in one calendar month by opportunistically leveraging compute nodes from a heterogeneous Condor pool.
Keywords :
grid computing; text analysis; IBM OmniFind; UIMA GRID; distributed architecture; distributed large-scale text analysis; heterogeneous Condor pool; information extraction; unstructured information management architecture; Calendars; Computer architecture; Data mining; Dictionaries; Environmental management; Information management; Large-scale systems; Resource management; Taxonomy; Text analysis;
Conference_Titel :
Cluster Computing and the Grid, 2007. CCGRID 2007. Seventh IEEE International Symposium on
Conference_Location :
Rio De Janeiro
Print_ISBN :
0-7695-2833-3
DOI :
10.1109/CCGRID.2007.118