DocumentCode
169839
Title
A Hadoop Extension to Process Mail Folders and its Application to a Spam Dataset
Author
Las-Casas, Pedro H. B. ; Santos Dias, Vinicius ; Ferreira, Ricardo ; Meira, Wagner ; Guedes, Dorgival
Author_Institution
Comput. Sci. Dept., Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil
fYear
2014
fDate
22-24 Oct. 2014
Firstpage
108
Lastpage
113
Abstract
Even as the web 2.0 grows, e-mail continues to be one of the most used forms of communication in the Internet, being responsible for the generation of huge amounts of data. Spam traffic, for example, accounts for terabytes of data daily. It becomes necessary to create tools that are able to process these data efficiently, in large volumes, in order to understand their characteristics. Although mail servers are able to receive and store messages as they arrive, applying complex algorithms to a large set of mailboxes, either for characterization, security reasons or for data mining goals is challenging. Big data processing environments such as Hadoop are useful for the analysis of large data sets, although originally designed to handle text files in general. In this paper we present a Hadoop extension used to process and analyze large sets of e-mail, organized in mailboxes. To evaluate it, we used gigabytes of real spam traffic data collected around the world and we showed that our approach is efficient to process large amounts of mail data.
Keywords
Big Data; Internet; data mining; unsolicited e-mail; Hadoop extension; Web 2.0; big data processing; data mining; e-mail; mail folder; spam dataset; spam traffic; Educational institutions; Electronic mail; Internet; Postal services; Programming; Servers; hadoop; mail; spam;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Architecture and High Performance Computing Workshop (SBAC-PADW), 2014 International Symposium on
Conference_Location
Paris
Type
conf
DOI
10.1109/SBAC-PADW.2014.25
Filename
6972024
Link To Document