• DocumentCode
    3437900
  • Title

    Blog extraction with template-independent wrapper

  • Author

    Zhang, Zhixuan ; Zhang, Chuang ; Lin, Zhiqing ; Xiao, Bo

  • Author_Institution
    Pattern Recognition & Intell. Syst. Lab.(PRIS), Beijing Univ. of Posts & Telecommun., Beijing, China
  • fYear
    2010
  • fDate
    24-26 Sept. 2010
  • Firstpage
    313
  • Lastpage
    317
  • Abstract
    Rich information is contributed to blogs by millions of users all around the world with the development of blogsphere. However, few work has been done on the study of blog extraction so far. Unlike the traditional template-dependent wrapper, not only blog articles but also blogroll is extracted with template-independent wrapper in this paper. In our method, blog extraction is formalized as a machine learning problem and a template-independent wrapper is learned by using labeled blog pages from a single site. Testing pages are obtained from 10 popular Chinese blog sites. And experimental results on 300 real blog pages indicate that the proposed method can correctly extract data from blogs with the accuracy of 90% or even above.
  • Keywords
    Web sites; data mining; learning (artificial intelligence); Chinese blog sites; Web sites; blog extraction; blogsphere; labeled blog pages; machine learning; template-independent wrapper; Feature extraction; Information services; Internet; Testing; Visualization; Web pages; data extraction; template-independent; web mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Network Infrastructure and Digital Content, 2010 2nd IEEE International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4244-6851-5
  • Type

    conf

  • DOI
    10.1109/ICNIDC.2010.5657967
  • Filename
    5657967