A text block context informations based multiple Web contents extraction

Author

Wonmoon Song;Myungwon Kim

Author_Institution

Strategic Business Team, ONYCOM, Seoul, Republic of Korea

fYear

2015

Firstpage

1

Lastpage

8

Abstract

In Web environment, in order to provide appropriate Web services to users´ needs it becomes important to quickly and accurately extract from Web documents contents such as main-content, menu-list, article-list, comments and so on. In this paper, we propose an efficient method that extracts various contents from Web documents. In the method, text blocks are separated from the document and context information is extracted and used to classify content type of each text block. Context information consists of documenting patterns and structural features of a Web document. For documenting patterns, we use in/out link information, which is expanded from word/link density proposed by a previous work. For structural features, distances between text blocks and parent tags of the target text block are used. We experimented with our method using a published data set and a data set that we collected. The experiment results show that our method performs about 17% points better in accuracy for multiple contents extraction and about 14% points better in F-measure for main-content extraction compared to the existing methods.

Keywords

"Feature extraction","HTML","Context","Visualization","Data mining","Standards","XML"

Publisher

ieee

Conference_Titel

Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE International Conference on

Print_ISBN

978-1-4673-8272-4

Type

conf

DOI

10.1109/DSAA.2015.7344829

Filename

7344829