DocumentCode
168804
Title
ToMaR -- A Data Generator for Large Volumes of Content
Author
Schmidt, R. ; Rella, Matthias ; Schlarb, Sven
Author_Institution
AIT Austrian Inst. of Technol., Vienna, Austria
fYear
2014
fDate
26-29 May 2014
Firstpage
937
Lastpage
942
Abstract
We present To MaR, a scalable application that supports the efficient integration of legacy applications within a MapReduce environment. The work is motivated by scenarios for scalable content processing developed in the context of the EC project SCAPE. ToMaR specifically addresses the need for extracting data sets from large volumes of binary content based on existing, content-specific applications within a scalable data management environment. This paper discusses the main functionalities of ToMaR and describes how ToMaR is utilized as part of a typical workflow. We present a real-word scenario that makes use of ToMaR for the characterization of archived web content. A workflow and experimental results which have been produced using sample content from the Web Archive Austria are discussed.
Keywords
Internet; content management; data analysis; software maintenance; EC project; MapReduce environment; SCAPE; ToMaR; Web Archive Austria; archived Web content characterization; binary content; content-specific applications; data generator; data set extraction; legacy application integration; scalable content processing; scalable data management environment; scalable preservation environment project; workflow; Containers; Context; Data mining; Libraries; Software packages; XML; content management; data-intensive computing; mapreduce; metadata extraction;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
Conference_Location
Chicago, IL
Type
conf
DOI
10.1109/CCGrid.2014.88
Filename
6846550
Link To Document