DocumentCode
3437900
Title
Blog extraction with template-independent wrapper
Author
Zhang, Zhixuan ; Zhang, Chuang ; Lin, Zhiqing ; Xiao, Bo
Author_Institution
Pattern Recognition & Intell. Syst. Lab.(PRIS), Beijing Univ. of Posts & Telecommun., Beijing, China
fYear
2010
fDate
24-26 Sept. 2010
Firstpage
313
Lastpage
317
Abstract
Rich information is contributed to blogs by millions of users all around the world with the development of blogsphere. However, few work has been done on the study of blog extraction so far. Unlike the traditional template-dependent wrapper, not only blog articles but also blogroll is extracted with template-independent wrapper in this paper. In our method, blog extraction is formalized as a machine learning problem and a template-independent wrapper is learned by using labeled blog pages from a single site. Testing pages are obtained from 10 popular Chinese blog sites. And experimental results on 300 real blog pages indicate that the proposed method can correctly extract data from blogs with the accuracy of 90% or even above.
Keywords
Web sites; data mining; learning (artificial intelligence); Chinese blog sites; Web sites; blog extraction; blogsphere; labeled blog pages; machine learning; template-independent wrapper; Feature extraction; Information services; Internet; Testing; Visualization; Web pages; data extraction; template-independent; web mining;
fLanguage
English
Publisher
ieee
Conference_Titel
Network Infrastructure and Digital Content, 2010 2nd IEEE International Conference on
Conference_Location
Beijing
Print_ISBN
978-1-4244-6851-5
Type
conf
DOI
10.1109/ICNIDC.2010.5657967
Filename
5657967
Link To Document