DocumentCode :
3113185
Title :
Basic semantic units based web page content extraction
Author :
Wang, Jingqi ; Chen, Qingcai ; Wang, Xiaolong ; Guo, Hongzhi
Author_Institution :
Shenzhen Grad. Sch., Intell. Comput. Res. Center, Harbin Inst. of Technol., Harbin
fYear :
2008
fDate :
12-15 Oct. 2008
Firstpage :
1489
Lastpage :
1494
Abstract :
Web page content extraction can be achieved by node-based and segmentation-based algorithms respectively on top of the document object model (DOM). However, the node-based algorithm often removes content embedded as anchor text; while the segmentation-based way can not distinguish irrelevant text from content text when they are divided into the same segment. The two kinds of algorithms don´t keep the paragraph information of the original page either. In this paper, a new basic semantic unit (BSU) with granularity between nodes in the DOM tree and content block is defined. Two different methods based on BSU, using clustering and heuristic rules are developed to extract page content. The clustering method gets the best precision 96.88%; while the heuristic rules obtain the best F1-value 95.28%. Compared with the baseline method which uses text blocks segmented by <table>and <div>as Web page content, the F1-values are enhanced by 8.92% and 9.42% respectively.
Keywords :
content management; information retrieval; pattern clustering; semantic Web; text analysis; tree data structures; Web page content extraction; anchor text; clustering method; document object model tree; heuristic rule; node-based algorithm; segmentation-based algorithm; semantic unit; Clustering algorithms; Clustering methods; Data mining; Displays; Explosions; HTML; Size measurement; Sliding mode control; Testing; Web pages; basic semantic unit; content extraction; line break tag; page segmentation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Systems, Man and Cybernetics, 2008. SMC 2008. IEEE International Conference on
Conference_Location :
Singapore
ISSN :
1062-922X
Print_ISBN :
978-1-4244-2383-5
Electronic_ISBN :
1062-922X
Type :
conf
DOI :
10.1109/ICSMC.2008.4811496
Filename :
4811496
Link To Document :
بازگشت