Abstract :
The web is a vast repository of human knowledge. Extracting structured data from web pages can enable applications like comparison shopping, and lead to improved ranking and rendering of search results. In this talk, I will describe two efforts at Yahoo! Labs to extract records from pages at web scale. The first is a wrapper induction system that handles end-to-end extraction tasks from clustering web pages to learning XPath extraction rules to relearning rules when sites change. The system has been deployed in production within Yahoo! to extract more than 200 million records from ∼200 web sites. The second effort exploits content redundancy on the web to automatically extract records without human supervision. Starting with a seed database, we determine values in the pages of each new site that match attribute values in the seed records. We devise a new notion of similarity for matching templatized attribute content, and an apriori style algorithm that exploits templatized page structure to prune spurious attribute matches.