• DocumentCode
    169839
  • Title

    A Hadoop Extension to Process Mail Folders and its Application to a Spam Dataset

  • Author

    Las-Casas, Pedro H. B. ; Santos Dias, Vinicius ; Ferreira, Ricardo ; Meira, Wagner ; Guedes, Dorgival

  • Author_Institution
    Comput. Sci. Dept., Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil
  • fYear
    2014
  • fDate
    22-24 Oct. 2014
  • Firstpage
    108
  • Lastpage
    113
  • Abstract
    Even as the web 2.0 grows, e-mail continues to be one of the most used forms of communication in the Internet, being responsible for the generation of huge amounts of data. Spam traffic, for example, accounts for terabytes of data daily. It becomes necessary to create tools that are able to process these data efficiently, in large volumes, in order to understand their characteristics. Although mail servers are able to receive and store messages as they arrive, applying complex algorithms to a large set of mailboxes, either for characterization, security reasons or for data mining goals is challenging. Big data processing environments such as Hadoop are useful for the analysis of large data sets, although originally designed to handle text files in general. In this paper we present a Hadoop extension used to process and analyze large sets of e-mail, organized in mailboxes. To evaluate it, we used gigabytes of real spam traffic data collected around the world and we showed that our approach is efficient to process large amounts of mail data.
  • Keywords
    Big Data; Internet; data mining; unsolicited e-mail; Hadoop extension; Web 2.0; big data processing; data mining; e-mail; mail folder; spam dataset; spam traffic; Educational institutions; Electronic mail; Internet; Postal services; Programming; Servers; hadoop; mail; spam;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Architecture and High Performance Computing Workshop (SBAC-PADW), 2014 International Symposium on
  • Conference_Location
    Paris
  • Type

    conf

  • DOI
    10.1109/SBAC-PADW.2014.25
  • Filename
    6972024