Title :
On Web Page extraction based on position of DIV
Author :
Liu, Xunhua ; Li, Hui ; Wu, Dan ; Huang, Jiaqing ; Wang, Wei ; Yu, Li ; Wu, Ye ; Xie, Hengjun
Author_Institution :
Key Lab. of Integrated Microsyst. Sci. & Eng. Applic., Shenzhen Grad. Sch. of Peking Univ., Shenzhen, China
Abstract :
For the popular DIV page layout in Web Pages, this paper presents a method based on the position of DIV to extract main text from the body of Web pages by reconstructing, remaining atomic DIV and analyzing DIV position. Experiments showed that the accuracy rate of extraction can reach more than 90%, with a high versatility and accuracy.
Keywords :
Internet; information filtering; text analysis; DIV page layout; Web page extraction; text extraction; Cascading style sheets; Containers; Data mining; HTML; Information analysis; Information retrieval; Microelectronics; Standards publication; Telecommunications; Web pages; DIV position analysis; main text of web page; web information extraction;
Conference_Titel :
Computer and Automation Engineering (ICCAE), 2010 The 2nd International Conference on
Conference_Location :
Singapore
Print_ISBN :
978-1-4244-5585-0
Electronic_ISBN :
978-1-4244-5586-7
DOI :
10.1109/ICCAE.2010.5451751