Title :
Graph-Based AJAX Crawl: Mining Data from Rich Internet Applications
Author :
Peng, Zhaomeng ; He, Nengqiang ; Jiang, Chunxiao ; Li, Zhihua ; Xu, Lei ; Li, Yipeng ; Ren, Yong
Author_Institution :
Dept. of Electron. Eng., Tsinghua Univ., Beijing, China
Abstract :
AJAX (Asynchronous JavaScript and XML) is becoming more and more popular with the prosperity of web 2.0. However, traditional crawlers fail to retrieve information from AJAX applications because of complex JavaScript operations. Moreover, a single AJAX application with one URL may have different page states, which violates the rule that one URL corresponds to one unique page. The AJAX application can be modeled as a state transition graph and to crawl AJAX is to traverse the graph without prior knowledge of its structure. In this paper, we have distinguished different AJAX events which are not well defined in previous work and proposed a Graph-based AJAX State Traversal (GAST) algorithm to crawl AJAX with minimal edge visits. If topology of the graph is given, this optimization problem turns into a Directed Rural Postman Problem (DRPP) and the optimal lower bound can be obtained. Experimental results show that the proposed algorithm approaches optimum and exhibits better performance than existing work.
Keywords :
Internet; Java; XML; data mining; graph theory; topology; URL; Web 2.0; asynchronous JavaScript and XML; data mining; directed rural postman problem; graph topology; graph-based AJAX crawl; graph-based AJAX state traversal algorithm; minimal edge visits; rich Internet applications; Algorithm design and analysis; Browsers; Crawlers; HTML; Heuristic algorithms; Robots; Topology; AJAX Crawl; Directed Rural Postman Problem; State Transition Graph; State Traversal;
Conference_Titel :
Computer Science and Electronics Engineering (ICCSEE), 2012 International Conference on
Conference_Location :
Hangzhou
Print_ISBN :
978-1-4673-0689-8
DOI :
10.1109/ICCSEE.2012.38