Title :
Advanced Deep Web Crawler Based on Dom
Author :
Ma, Weicheng ; Chen, Xiuxia ; Shang, Wenqian
Author_Institution :
Sch. of Comput., Commun. Univ. of China, Beijing, China
Abstract :
Due to the fact that large amount of data today can only be stored in deep web. In view of the work done by others on deep web crawlers, it is extinct that no perfect, or even complete crawlers for deep web data has been made. To meet the needs of deep web search, we have worked out a new structure of crawler, currently concerned most on extracting data from forms - the most common type of deep web interface. Our crawler´s makes some innovative parts such as the mainframe extracting module and the algorithm to distinguish different websites with the same url using improved Bayesian classification and to expand the function to AJAX form dealing and so on. Also, Dom Tree is used to make easier and more visual the analysis and treatment of downloaded web pages.
Keywords :
Bayes methods; Internet; Web sites; document handling; information retrieval; pattern classification; trees (mathematics); AJAX form; Bayesian classification; Dom Tree; Web pages; Website URL; advanced deep Web crawler; crawler structure; deep Web data; deep Web interface; deep Web search; form data extraction; mainframe extracting module; Bayesian methods; Crawlers; Data mining; Feature extraction; HTML; Web pages; XML; AJAX; Deep Web; Dom Tree; Form;
Conference_Titel :
Computational Sciences and Optimization (CSO), 2012 Fifth International Joint Conference on
Conference_Location :
Harbin
Print_ISBN :
978-1-4673-1365-0
DOI :
10.1109/CSO.2012.138