DocumentCode :
584447
Title :
Content Extraction from Chinese Web Pages Based on Punctuations Distribution
Author :
Peng, Qian ; Wang, Qinglin ; Li, Yuan ; Zhang, Jixian ; Hao, Yuexing
Author_Institution :
Sch. of Autom. Beijing, Inst. of Technol., Beijing, China
fYear :
2012
fDate :
11-13 Aug. 2012
Firstpage :
1351
Lastpage :
1355
Abstract :
Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.
Keywords :
Internet; hypermedia markup languages; information resources; information retrieval; natural language processing; Chinese punctuation distribution; Chinese web pages; HTML page; Internet; content extraction; information resources; left boundary computation; right boundary computation; Accuracy; Data mining; Feature extraction; HTML; Kernel; Navigation; Web pages; content extraction; kernel punctuation; punctuation distruction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Science & Service System (CSSS), 2012 International Conference on
Conference_Location :
Nanjing
Print_ISBN :
978-1-4673-0721-5
Type :
conf
DOI :
10.1109/CSSS.2012.341
Filename :
6394579
Link To Document :
بازگشت