DocumentCode
684851
Title
A new algorithm: Extracting text information from Webpage based on block and tag-function
Author
Dingrong Yuan ; Xiaohu Yang ; Xue Nong ; Huiwen Fu
Author_Institution
Coll. of Comput. Sci. & Inf. Technol., Guangxi Normal Univ., Guilin, China
fYear
2012
fDate
7-9 Dec. 2012
Firstpage
1
Lastpage
4
Abstract
A Webpage contains lots of information that users needed, however it also fills with plenty of noises. How to remove these noises and extract useful text information has become one of the hottest topics in the field of Web data mining. This paper proposes a text information extraction algorithm based on visual information and tag-function. In this algorithm, firstly a webpage is divided into different blocks, and then we extract text information from these blocks based on rules, which are extracted from the characteristics of tag-function. Experiments show that the algorithm is effective and efficient.
Keywords
Web sites; data mining; text analysis; Web data mining; Webpage; block-function; tag-function; text information extraction algorithm; DOM tree; information extraction; tag-function; text information; visual block;
fLanguage
English
Publisher
iet
Conference_Titel
Information Science and Control Engineering 2012 (ICISCE 2012), IET International Conference on
Conference_Location
Shenzhen
Electronic_ISBN
978-1-84919-641-3
Type
conf
DOI
10.1049/cp.2012.2437
Filename
6755816
Link To Document