DocumentCode :
2163645
Title :
On Web Page extraction based on position of DIV
Author :
Liu, Xunhua ; Li, Hui ; Wu, Dan ; Huang, Jiaqing ; Wang, Wei ; Yu, Li ; Wu, Ye ; Xie, Hengjun
Author_Institution :
Key Lab. of Integrated Microsyst. Sci. & Eng. Applic., Shenzhen Grad. Sch. of Peking Univ., Shenzhen, China
Volume :
4
fYear :
2010
fDate :
26-28 Feb. 2010
Firstpage :
144
Lastpage :
147
Abstract :
For the popular DIV page layout in Web Pages, this paper presents a method based on the position of DIV to extract main text from the body of Web pages by reconstructing, remaining atomic DIV and analyzing DIV position. Experiments showed that the accuracy rate of extraction can reach more than 90%, with a high versatility and accuracy.
Keywords :
Internet; information filtering; text analysis; DIV page layout; Web page extraction; text extraction; Cascading style sheets; Containers; Data mining; HTML; Information analysis; Information retrieval; Microelectronics; Standards publication; Telecommunications; Web pages; DIV position analysis; main text of web page; web information extraction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer and Automation Engineering (ICCAE), 2010 The 2nd International Conference on
Conference_Location :
Singapore
Print_ISBN :
978-1-4244-5585-0
Electronic_ISBN :
978-1-4244-5586-7
Type :
conf
DOI :
10.1109/ICCAE.2010.5451751
Filename :
5451751
Link To Document :
بازگشت