مرکز منطقه ای اطلاع رساني علوم و فناوري - A Web Text Extraction Method Based on Regular Expressions and Text Density

DocumentCode :

2812833

Title :

A Web Text Extraction Method Based on Regular Expressions and Text Density

Author :

Li, Fayun

Author_Institution :

Public Manage. Sch., Fuzhou Univ., Fuzhou, China

Volume :

fYear :

2011

fDate :

26-27 Nov. 2011

Firstpage :

287

Lastpage :

290

Abstract :

With the advantages of some current web text extraction algorithms, this paper puts forward a new method based on the combination of the regular expressions and density of page text, the method firstly uses the regular expressions to clear the html tags by the characteristics of the web page source code, and then extracts the main text of page with the distribution density of text. The algorithm is simple and efficient and the method proves to have higher accuracy for extraction after tests.

Keywords :

Internet; hypermedia markup languages; information retrieval; text analysis; HTML tags; Web page source code; Web text extraction method; regular expressions; text density; Accuracy; Algorithm design and analysis; Data mining; Feature extraction; HTML; Noise; Web pages; Regular expressions; Text density; Text extraction; Web page;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Information Management, Innovation Management and Industrial Engineering (ICIII), 2011 International Conference on

Conference_Location :

Shenzhen

Print_ISBN :

978-1-61284-450-3

Type :

conf

DOI :

10.1109/ICIII.2011.73

Filename :

6115483

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2812833