DocumentCode :
2155636
Title :
Research of Self-Adaptive Web Page Parser Based on Templates and Rules
Author :
Hu, Jinzhu ; Zhou, Xing ; Shu, Jiangbo ; Xiong, Chunxiu
Author_Institution :
Dept. of Comput. Sci., HuaZhong Normal Univ., Wuhan, China
fYear :
2009
fDate :
20-22 Sept. 2009
Firstpage :
1
Lastpage :
4
Abstract :
Web pages parsing is a concerned topic in recent years, how to get rid of human intervention and formulate extraction rules of subject information from a large number of Web pages at the fastest and most accurate speed has becoming an important research point in this field. This paper proposes a frame of self-adaptive web page parser based on templates and rules. Firstly, it uses the noise filter algorithm to filter irrelevant nodes and invalid nodes, and then combines the ways of page template and heuristic rule to generate extraction rules, at the same time it can adjust extraction rules dynamically according to external factors through automatic detection mechanism. Using this frame to generate parsers has better self-adaptability, being able to generate extraction rules better, and being able to locate and extract subject information better. The experimental result shows the effectiveness of the parser.
Keywords :
Web sites; grammars; extraction rules; heuristic rule; noise filter algorithm; page template; self-adaptability; self-adaptive Web page parser; Computer science; Data mining; Databases; Filters; HTML; Humans; Information technology; Noise generators; Search engines; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Management and Service Science, 2009. MASS '09. International Conference on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-4638-4
Electronic_ISBN :
978-1-4244-4639-1
Type :
conf
DOI :
10.1109/ICMSS.2009.5304105
Filename :
5304105
Link To Document :
بازگشت