• DocumentCode
    168804
  • Title

    ToMaR -- A Data Generator for Large Volumes of Content

  • Author

    Schmidt, R. ; Rella, Matthias ; Schlarb, Sven

  • Author_Institution
    AIT Austrian Inst. of Technol., Vienna, Austria
  • fYear
    2014
  • fDate
    26-29 May 2014
  • Firstpage
    937
  • Lastpage
    942
  • Abstract
    We present To MaR, a scalable application that supports the efficient integration of legacy applications within a MapReduce environment. The work is motivated by scenarios for scalable content processing developed in the context of the EC project SCAPE. ToMaR specifically addresses the need for extracting data sets from large volumes of binary content based on existing, content-specific applications within a scalable data management environment. This paper discusses the main functionalities of ToMaR and describes how ToMaR is utilized as part of a typical workflow. We present a real-word scenario that makes use of ToMaR for the characterization of archived web content. A workflow and experimental results which have been produced using sample content from the Web Archive Austria are discussed.
  • Keywords
    Internet; content management; data analysis; software maintenance; EC project; MapReduce environment; SCAPE; ToMaR; Web Archive Austria; archived Web content characterization; binary content; content-specific applications; data generator; data set extraction; legacy application integration; scalable content processing; scalable data management environment; scalable preservation environment project; workflow; Containers; Context; Data mining; Libraries; Software packages; XML; content management; data-intensive computing; mapreduce; metadata extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
  • Conference_Location
    Chicago, IL
  • Type

    conf

  • DOI
    10.1109/CCGrid.2014.88
  • Filename
    6846550