DocumentCode :
2888437
Title :
Endless and Scalable Knowledge Table Extraction from Semi-structured Websites
Author :
Yingqin Gu ; Lei Ji ; Ziheng Jiang ; Jun He
Author_Institution :
Key Labs. of Data Eng. & Knowledge Eng., Renmin Univ. of China, Beijing, China
fYear :
2012
fDate :
10-10 Dec. 2012
Firstpage :
835
Lastpage :
842
Abstract :
The problem of scalable knowledge extraction from the Web has attracted much attention in the past decade. However, it is under explored how to extract the structured knowledge from semi-structured Websites in a fully automatic and scalable way. In this work, we define the table-formatted structured data with clear schema as Knowledge Tables and propose a scalable learning system, which is named as Kable to extract knowledge from semi-structured Websites automatically in a never ending and scalable way. Kable consists of two major components, which are auto wrapper induction and schema matching respectively. In contrast to the state of the art auto wrappers for semi-structured Web sites, our adopted approach can run around 1´000 times faster, which makes the Web scale knowledge extraction possible. On the other hand, we propose a novel schema matching solution which can work effectively on the auto-extracted structured data. With 3 months´ continuous run using ten Web servers, we successfully extracted 427,105,009 knowledge facts. The manual labeling over sampled knowledge extracted show the up to 87% precision for supporting various Web applications.
Keywords :
Internet; Web sites; data mining; file servers; information retrieval; learning (artificial intelligence); Kable; Web applications; Web servers; auto wrapper induction; auto-extracted structured data; automatic semistructured Website structured knowledge extraction; endless knowledge table extraction; scalable knowledge table extraction; scalable learning system; schema matching solution; table-formatted structured data; Algorithm design and analysis; Clustering algorithms; Data mining; Knowledge based systems; Knowledge engineering; Manganese; Motion pictures; information extraction system; knowledge table; schema matching;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on
Conference_Location :
Brussels
Print_ISBN :
978-1-4673-5164-5
Type :
conf
DOI :
10.1109/ICDMW.2012.115
Filename :
6406526
Link To Document :
بازگشت