• DocumentCode
    3125257
  • Title

    Social Streams Blog Crawler

  • Author

    Hurst, Matthew ; Maykov, Alexey

  • Author_Institution
    One Microsoft, Redmond, WA
  • fYear
    2009
  • fDate
    March 29 2009-April 2 2009
  • Firstpage
    1615
  • Lastpage
    1618
  • Abstract
    Weblogs, and other forms of social media, differ from traditional Web content in many ways. One of the most important differences is the highly temporal nature of the content. Applications that leverage social media content must, to be effective, have access to this data with minimal publication/acquisition latency. An effective Weblog crawler should satisfy the following requirements: low latency, highly scalable, high data quality and appropriate network politeness. In this paper, we outline the Weblog crawler implemented in the social streams project and summarize the challenges faced during development.
  • Keywords
    Web sites; search engines; Weblog; blog crawler; social media content; social streams project; Crawlers; Data engineering; Delay; Discussion forums; Feeds; HTML; Information services; Internet; Search engines; Web sites; blogs; crawling; social media; web; weblogs;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2009. ICDE '09. IEEE 25th International Conference on
  • Conference_Location
    Shanghai
  • ISSN
    1084-4627
  • Print_ISBN
    978-1-4244-3422-0
  • Electronic_ISBN
    1084-4627
  • Type

    conf

  • DOI
    10.1109/ICDE.2009.146
  • Filename
    4812583