DocumentCode :
2120002
Title :
Extracting Web News Using Tag Path Patterns
Author :
Gongqing Wu ; Xindong Wu
Author_Institution :
Sch. of Comput. Sci. & Inf. Eng., Hefei Univ. of Technol., Hefei, China
Volume :
1
fYear :
2012
fDate :
4-7 Dec. 2012
Firstpage :
588
Lastpage :
595
Abstract :
How to accurately extract the content of Web news is a popular and significant issue in Web Intelligence. Many Web news sites have similar structures and layout styles, and there are potential correlations between Web content layouts and tag path patterns. Compared with other extraction features, such as HTML tags, literal words and visual features, a tag path pattern not only addresses content segments well, but also has an advantage in the generalization. However, can we accurately extract Web news using only tag path patterns? Motivated by this problem, we propose a PPWIE extraction model. We design an extraction algorithm WEtr using self-defined tag path patterns, and then define a special tag path pattern called the distinguishing tag path pattern. In addition, to tackle the NPC-hard problem in path pattern mining, we propose a polynomial-time (ln|n|+1)-approximation algorithm MPM, in which n indicates the scale of positive samples. Our experiments show that our integration method WEtr+MPM in PPWIE can achieve better performance with more than 98% of precision, recall and the F-score on real world datasets.
Keywords :
Web sites; approximation theory; computational complexity; data mining; information retrieval; F-score value; NPC-hard problem; PPWIE extraction model; WEtr extraction algorithm design; WEtr+MPM integration method; Web Intelligence; Web content layouts; Web news content extraction features; Web news sites; content segments; distinguishing tag path pattern; path pattern mining; polynomial-time approximation algorithm; precision value; recall value; self-defined tag path patterns; Distinguishing Tag Path Pattern; Pattern Mining; Web Information Extraction; Web News;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2012 IEEE/WIC/ACM International Conferences on
Conference_Location :
Macau
Print_ISBN :
978-1-4673-6057-9
Type :
conf
DOI :
10.1109/WI-IAT.2012.107
Filename :
6511946
Link To Document :
بازگشت