A new algorithm: Extracting text information from Webpage based on block and tag-function

Author

Dingrong Yuan ; Xiaohu Yang ; Xue Nong ; Huiwen Fu

Author_Institution

Coll. of Comput. Sci. & Inf. Technol., Guangxi Normal Univ., Guilin, China

fYear

2012

fDate

7-9 Dec. 2012

Firstpage

Lastpage

Abstract

A Webpage contains lots of information that users needed, however it also fills with plenty of noises. How to remove these noises and extract useful text information has become one of the hottest topics in the field of Web data mining. This paper proposes a text information extraction algorithm based on visual information and tag-function. In this algorithm, firstly a webpage is divided into different blocks, and then we extract text information from these blocks based on rules, which are extracted from the characteristics of tag-function. Experiments show that the algorithm is effective and efficient.

Keywords

Web sites; data mining; text analysis; Web data mining; Webpage; block-function; tag-function; text information extraction algorithm; DOM tree; information extraction; tag-function; text information; visual block;

fLanguage

English

Publisher

iet

Conference_Titel

Information Science and Control Engineering 2012 (ICISCE 2012), IET International Conference on

Conference_Location

Shenzhen

Electronic_ISBN

978-1-84919-641-3

Type

conf

DOI

10.1049/cp.2012.2437

Filename

6755816

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=684851