Title :
Analysis and detection of Soft-404 pages
Author :
Prieto, Victor M. ; Alvarez, M. ; Cacheda, Fidel
Author_Institution :
Dept. of Inf. & Commun. Technol., Univ. of A Coruna, A Coruña, Spain
Abstract :
The WWW is continuously growing, but sometimes, not in the best way due to the proliferation of garbage contents, such as Web Spam pages, duplicate content or dead links. Some web servers do not always use the appropriate HTTP response code for dead links making them to be incorrectly identified, producing a problem for search engines. Our analysis has revealed that 7.35% of web servers send a 200 HTTP code when a request for an unknown document is received, instead of a 404 code, which indicates that the document is not found. These web pages are known as Soft-404 pages. Soft-404 pages are a problem for search engines, and their crawling modules, which process and index these pages, with the consequent loss of resources. There are few studies that analyse this problem and try to solve it. In this article we propose a new detection system for Soft-404 pages, called Soft404Detector, which uses a set of content-based heuristics and combines them with a C4.5 classifier. For testing purposes, we built a Soft-404 pages dataset. Our experiments indicate that our system is very effective, achieving a precision of 0.992 and a recall of 0.980 at Soft-404 pages.
Keywords :
Web sites; classification; content management; indexing; search engines; 404 code; C4.5 classifier; HTTP response code; Soft404Detector; Web servers; Web spam pages; World Wide Web; content-based heuristics; crawling modules; dead links; detection system; duplicate content; garbage contents; page indexing; page processing; search engines; soft-404 pages analysis; soft-404 pages detection; Algorithm design and analysis; Crawlers; Search engines; Training; Unsolicited electronic mail; Web pages; Web servers;
Conference_Titel :
Innovative Computing Technology (INTECH), 2013 Third International Conference on
Conference_Location :
London
Print_ISBN :
978-1-4799-0047-3
DOI :
10.1109/INTECH.2013.6653695