DocumentCode
584447
Title
Content Extraction from Chinese Web Pages Based on Punctuations Distribution
Author
Peng, Qian ; Wang, Qinglin ; Li, Yuan ; Zhang, Jixian ; Hao, Yuexing
Author_Institution
Sch. of Autom. Beijing, Inst. of Technol., Beijing, China
fYear
2012
fDate
11-13 Aug. 2012
Firstpage
1351
Lastpage
1355
Abstract
Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.
Keywords
Internet; hypermedia markup languages; information resources; information retrieval; natural language processing; Chinese punctuation distribution; Chinese web pages; HTML page; Internet; content extraction; information resources; left boundary computation; right boundary computation; Accuracy; Data mining; Feature extraction; HTML; Kernel; Navigation; Web pages; content extraction; kernel punctuation; punctuation distruction;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Science & Service System (CSSS), 2012 International Conference on
Conference_Location
Nanjing
Print_ISBN
978-1-4673-0721-5
Type
conf
DOI
10.1109/CSSS.2012.341
Filename
6394579
Link To Document