Title :
A Large Scale URL Verification Pipeline Using Hadoop
Author :
Guo, Songtao ; Dong, Jianxiong
Author_Institution :
AT&T Interactive, San Francisco, CA, USA
Abstract :
Data quality is a key element for local search and advertising. Inaccurate, out-of-date or missing information causes an unpleasant search experience for users and affects competitiveness of service providers. This paper addresses the problem of evaluating link quality for business listings in local search and online advertising domain. We introduce a novel system where we apply data mining technologies on a Hadoop-based platform to provide an efficient and highly scalable solution for the problem. Due to various reasons, links associated with business listings do not always point to their business websites. Possible noises include parked domain, broken links, third-party advertisers, irrelevant websites etc. To detect above noises and improve link quality, we formulate this problem as a binary classification problem: whether a given URL is the business website of the associated listing. Experiments conducted on real-world data show that our system can verify millions of business listings against about 100 million web pages in a couple of hours with 93% classification accuracy.
Keywords :
Web sites; advertising; business data processing; data mining; distributed processing; information retrieval; pattern classification; Hadoop; advertising; binary classification problem; broken links; business Websites; business listings; data mining technologies; data quality; irrelevant Websites; large scale URL verification pipeline; local search; online advertising domain; parked domain; service providers; third party advertisers; Business; Cities and towns; Data mining; Facsimile; Feature extraction; Internet; Web pages; classification; cloud computing; data quality; web information extraction;
Conference_Titel :
Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on
Conference_Location :
Vancouver, BC
Print_ISBN :
978-1-4673-0005-6
DOI :
10.1109/ICDMW.2011.13