Title :
E-mail address categorization based on semantics of surnames
Author :
Veluru, Suresh ; Rahulamathavan, Yogachandran ; Viswanath, Pramod ; Longley, Paul ; Rajarajan, Muttukrishnan
Author_Institution :
Inf. Security Group, City Univ. London, London, UK
Abstract :
Surname (family name) analysis is used in geography to understand population origins, migration, identity, social norms and cultural customs. Some of these are supposedly evolved over generations. Surnames exhibit good statistical properties that can be used to extract information in names data set such as automatic detection of ethnic or community groups in names. An e-mail address, often contains surname as a substring. This containment may be full or partial. An e-mail address categorization based on semantics of surnames is the objective of this paper. This is achieved in two phases. First phase deals with surname representation and clustering. Here, a vector space model is proposed where latent semantic analysis is performed. Clustering is done using the method called average-linkage method. In the second phase, an email is categorized as belonging to one of the categories (discovered in first phase). For this, substring matching is required, which is done in an efficient way by using suffix tree data structure. We perform experimental evaluation for the 500 most frequently occurring surnames in India and United Kingdom. Also, we categorize the e-mail addresses that have these surnames as substrings.
Keywords :
data analysis; electronic mail; pattern clustering; semantic networks; string matching; tree data structures; vectors; automatic detection; average-linkage method; community groups; cultural customs; e-mail address categorization; ethnic groups; family name analysis; geography; latent semantic analysis; names data set; population identity; population migration; population origins; social norms; statistical properties; substring matching; suffix tree data structure; surname clustering; surname representation; vector space model; Clustering algorithms; Clustering methods; Data mining; Electronic mail; Matrix decomposition; Semantics; Vectors; Vector space model; average link clustering method; latent semantic analysis; suffix tree; surnames;
Conference_Titel :
Computational Intelligence and Data Mining (CIDM), 2013 IEEE Symposium on
Conference_Location :
Singapore
DOI :
10.1109/CIDM.2013.6597240