DocumentCode :
2812833
Title :
A Web Text Extraction Method Based on Regular Expressions and Text Density
Author :
Li, Fayun
Author_Institution :
Public Manage. Sch., Fuzhou Univ., Fuzhou, China
Volume :
1
fYear :
2011
fDate :
26-27 Nov. 2011
Firstpage :
287
Lastpage :
290
Abstract :
With the advantages of some current web text extraction algorithms, this paper puts forward a new method based on the combination of the regular expressions and density of page text, the method firstly uses the regular expressions to clear the html tags by the characteristics of the web page source code, and then extracts the main text of page with the distribution density of text. The algorithm is simple and efficient and the method proves to have higher accuracy for extraction after tests.
Keywords :
Internet; hypermedia markup languages; information retrieval; text analysis; HTML tags; Web page source code; Web text extraction method; regular expressions; text density; Accuracy; Algorithm design and analysis; Data mining; Feature extraction; HTML; Noise; Web pages; Regular expressions; Text density; Text extraction; Web page;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Management, Innovation Management and Industrial Engineering (ICIII), 2011 International Conference on
Conference_Location :
Shenzhen
Print_ISBN :
978-1-61284-450-3
Type :
conf
DOI :
10.1109/ICIII.2011.73
Filename :
6115483
Link To Document :
بازگشت