Blog extraction with template-independent wrapper

Author

Zhang, Zhixuan ; Zhang, Chuang ; Lin, Zhiqing ; Xiao, Bo

Author_Institution

Pattern Recognition & Intell. Syst. Lab.(PRIS), Beijing Univ. of Posts & Telecommun., Beijing, China

fYear

2010

fDate

24-26 Sept. 2010

Firstpage

313

Lastpage

317

Abstract

Rich information is contributed to blogs by millions of users all around the world with the development of blogsphere. However, few work has been done on the study of blog extraction so far. Unlike the traditional template-dependent wrapper, not only blog articles but also blogroll is extracted with template-independent wrapper in this paper. In our method, blog extraction is formalized as a machine learning problem and a template-independent wrapper is learned by using labeled blog pages from a single site. Testing pages are obtained from 10 popular Chinese blog sites. And experimental results on 300 real blog pages indicate that the proposed method can correctly extract data from blogs with the accuracy of 90% or even above.

Keywords

Web sites; data mining; learning (artificial intelligence); Chinese blog sites; Web sites; blog extraction; blogsphere; labeled blog pages; machine learning; template-independent wrapper; Feature extraction; Information services; Internet; Testing; Visualization; Web pages; data extraction; template-independent; web mining;

fLanguage

English

Publisher

ieee

Conference_Titel

Network Infrastructure and Digital Content, 2010 2nd IEEE International Conference on

Conference_Location

Beijing

Print_ISBN

978-1-4244-6851-5

Type

conf

DOI

10.1109/ICNIDC.2010.5657967

Filename

5657967