Title :
ToMaR -- A Data Generator for Large Volumes of Content
Author :
Schmidt, R. ; Rella, Matthias ; Schlarb, Sven
Author_Institution :
AIT Austrian Inst. of Technol., Vienna, Austria
Abstract :
We present To MaR, a scalable application that supports the efficient integration of legacy applications within a MapReduce environment. The work is motivated by scenarios for scalable content processing developed in the context of the EC project SCAPE. ToMaR specifically addresses the need for extracting data sets from large volumes of binary content based on existing, content-specific applications within a scalable data management environment. This paper discusses the main functionalities of ToMaR and describes how ToMaR is utilized as part of a typical workflow. We present a real-word scenario that makes use of ToMaR for the characterization of archived web content. A workflow and experimental results which have been produced using sample content from the Web Archive Austria are discussed.
Keywords :
Internet; content management; data analysis; software maintenance; EC project; MapReduce environment; SCAPE; ToMaR; Web Archive Austria; archived Web content characterization; binary content; content-specific applications; data generator; data set extraction; legacy application integration; scalable content processing; scalable data management environment; scalable preservation environment project; workflow; Containers; Context; Data mining; Libraries; Software packages; XML; content management; data-intensive computing; mapreduce; metadata extraction;
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
Conference_Location :
Chicago, IL
DOI :
10.1109/CCGrid.2014.88