Content Extraction from Chinese Web Pages Based on Punctuations Distribution

Author

Peng, Qian ; Wang, Qinglin ; Li, Yuan ; Zhang, Jixian ; Hao, Yuexing

Author_Institution

Sch. of Autom. Beijing, Inst. of Technol., Beijing, China

fYear

2012

fDate

11-13 Aug. 2012

Firstpage

1351

Lastpage

1355

Abstract

Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.

Keywords

Internet; hypermedia markup languages; information resources; information retrieval; natural language processing; Chinese punctuation distribution; Chinese web pages; HTML page; Internet; content extraction; information resources; left boundary computation; right boundary computation; Accuracy; Data mining; Feature extraction; HTML; Kernel; Navigation; Web pages; content extraction; kernel punctuation; punctuation distruction;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Science & Service System (CSSS), 2012 International Conference on

Conference_Location

Nanjing

Print_ISBN

978-1-4673-0721-5

Type

conf

DOI

10.1109/CSSS.2012.341

Filename

6394579