DocumentCode :
2555976
Title :
Banquet speaker
Author :
Rastogi, Rajeev
Author_Institution :
Yahoo! Labs, Bangalore India
fYear :
2011
fDate :
4-8 Jan. 2011
Firstpage :
1
Lastpage :
2
Abstract :
The web is a vast repository of human knowledge. Extracting structured data from web pages can enable applications like comparison shopping, and lead to improved ranking and rendering of search results. In this talk, I will describe two efforts at Yahoo! Labs to extract records from pages at web scale. The first is a wrapper induction system that handles end-to-end extraction tasks from clustering web pages to learning XPath extraction rules to relearning rules when sites change. The system has been deployed in production within Yahoo! to extract more than 200 million records from ∼200 web sites. The second effort exploits content redundancy on the web to automatically extract records without human supervision. Starting with a seed database, we determine values in the pages of each new site that match attribute values in the seed records. We devise a new notion of similarity for matching templatized attribute content, and an apriori style algorithm that exploits templatized page structure to prune spurious attribute matches.
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Communication Systems and Networks (COMSNETS), 2011 Third International Conference on
Conference_Location :
Bangalore
Print_ISBN :
978-1-4244-8952-7
Electronic_ISBN :
978-1-4244-8951-0
Type :
conf
DOI :
10.1109/COMSNETS.2011.5716387
Filename :
5716387
Link To Document :
بازگشت