DocumentCode :
2839935
Title :
A template-based method for theme information extraction from web pages
Author :
Yin, Gui-Sheng ; Guo, Guang-Dong ; Sun, Jing-Jing
Author_Institution :
Dept. of Comput. Sci. & Technol., Harbin Eng. Univ., Harbin, China
Volume :
3
fYear :
2010
fDate :
22-24 Oct. 2010
Abstract :
The introducing web page templates and DOM technology can effectively extract simple structured information from web information. In reference to previous research achievements of the foundation, this paper presents a new method of inductive web page templates. This method is able to contain various layout elements of the web page templates. The main research contents include the methods based on edit distance, about DOM document similarity judgment, clustering methods focus on web structure, the extraction methods of web page templates and programming a information extraction engine.
Keywords :
Web sites; distributed object management; document handling; information retrieval; DOM document similarity judgment; DOM technology; Web information; Web pages; Web structure; clustering method; edit distance; inductive Web page templates; information extraction engine; simple structured information; theme information extraction; Electronic mail; Noise reduction; Page similarity; Template Method; Web Clustering; Web Extraction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Application and System Modeling (ICCASM), 2010 International Conference on
Conference_Location :
Taiyuan
Print_ISBN :
978-1-4244-7235-2
Electronic_ISBN :
978-1-4244-7237-6
Type :
conf
DOI :
10.1109/ICCASM.2010.5620763
Filename :
5620763
Link To Document :
بازگشت