Title :
A framework for multi-features based Web harmful information identification
Author :
Tian, Xiao-Ping ; Geng, Guang-Gang ; Li, Hong-Tao
Author_Institution :
Center of Inf. & Network Technol., Beijing Normal Univ., Beijing, China
Abstract :
In recent years, the spread of harmful information such as pornography, phishing and violence, seriously disturbs the order of the Web, causes a series of adverse effects, and especially affects young people´s physical and mental health. Statistical learning based harmful information detection methods, the current research focus, have shown their superiority for easily adapting to newly developed harmful techniques. Feature selection is one of key factors that influence the development of Web harmful information detection system. This paper will describe a novel framework for recognizing harmful Web pages. In this framework multi-modal features will be extracted and each modal feather shows the different aspect of the spam information. Based on these features, we will give a feature fusion strategy. Considering the distribution of normal and harmful websites, we investigate the use of an ensemble under-sampling classification strategy to exploit the inherent imbalance of labels in this classification problem.
Keywords :
Internet; Web sites; classification; computer crime; feature extraction; statistical analysis; Web harmful information identification; World Wide Web; feature fusion strategy; harmful Web pages; harmful Web sites; harmful information detection methods; mental health; multimodal feature extraction; normal Web sites; phishing; physical health; pornography; spam information; statistical learning; under-sampling classification strategy; violence; Data mining; Feature extraction; Internet; Modeling; Training; Unsolicited electronic mail; Web pages;
Conference_Titel :
Computer Application and System Modeling (ICCASM), 2010 International Conference on
Conference_Location :
Taiyuan
Print_ISBN :
978-1-4244-7235-2
Electronic_ISBN :
978-1-4244-7237-6
DOI :
10.1109/ICCASM.2010.5623130