ToMaR -- A Data Generator for Large Volumes of Content

Author

Schmidt, R. ; Rella, Matthias ; Schlarb, Sven

Author_Institution

AIT Austrian Inst. of Technol., Vienna, Austria

fYear

2014

fDate

26-29 May 2014

Firstpage

937

Lastpage

942

Abstract

We present To MaR, a scalable application that supports the efficient integration of legacy applications within a MapReduce environment. The work is motivated by scenarios for scalable content processing developed in the context of the EC project SCAPE. ToMaR specifically addresses the need for extracting data sets from large volumes of binary content based on existing, content-specific applications within a scalable data management environment. This paper discusses the main functionalities of ToMaR and describes how ToMaR is utilized as part of a typical workflow. We present a real-word scenario that makes use of ToMaR for the characterization of archived web content. A workflow and experimental results which have been produced using sample content from the Web Archive Austria are discussed.

Keywords

Internet; content management; data analysis; software maintenance; EC project; MapReduce environment; SCAPE; ToMaR; Web Archive Austria; archived Web content characterization; binary content; content-specific applications; data generator; data set extraction; legacy application integration; scalable content processing; scalable data management environment; scalable preservation environment project; workflow; Containers; Context; Data mining; Libraries; Software packages; XML; content management; data-intensive computing; mapreduce; metadata extraction;

fLanguage

English

Publisher

ieee

Conference_Titel

Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on

Conference_Location

Chicago, IL

Type

conf

DOI

10.1109/CCGrid.2014.88

Filename

6846550