• DocumentCode
    3127493
  • Title

    A Large Scale URL Verification Pipeline Using Hadoop

  • Author

    Guo, Songtao ; Dong, Jianxiong

  • Author_Institution
    AT&T Interactive, San Francisco, CA, USA
  • fYear
    2011
  • fDate
    11-11 Dec. 2011
  • Firstpage
    159
  • Lastpage
    166
  • Abstract
    Data quality is a key element for local search and advertising. Inaccurate, out-of-date or missing information causes an unpleasant search experience for users and affects competitiveness of service providers. This paper addresses the problem of evaluating link quality for business listings in local search and online advertising domain. We introduce a novel system where we apply data mining technologies on a Hadoop-based platform to provide an efficient and highly scalable solution for the problem. Due to various reasons, links associated with business listings do not always point to their business websites. Possible noises include parked domain, broken links, third-party advertisers, irrelevant websites etc. To detect above noises and improve link quality, we formulate this problem as a binary classification problem: whether a given URL is the business website of the associated listing. Experiments conducted on real-world data show that our system can verify millions of business listings against about 100 million web pages in a couple of hours with 93% classification accuracy.
  • Keywords
    Web sites; advertising; business data processing; data mining; distributed processing; information retrieval; pattern classification; Hadoop; advertising; binary classification problem; broken links; business Websites; business listings; data mining technologies; data quality; irrelevant Websites; large scale URL verification pipeline; local search; online advertising domain; parked domain; service providers; third party advertisers; Business; Cities and towns; Data mining; Facsimile; Feature extraction; Internet; Web pages; classification; cloud computing; data quality; web information extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on
  • Conference_Location
    Vancouver, BC
  • Print_ISBN
    978-1-4673-0005-6
  • Type

    conf

  • DOI
    10.1109/ICDMW.2011.13
  • Filename
    6137375