DocumentCode :
3717250
Title :
A pipeline for extracting and deduplicating domain-specific knowledge bases
Author :
Mayank Kejriwal;Qiaoling Liu;Ferosh Jacob;Faizan Javed
Author_Institution :
The University of Texas at Austin
fYear :
2015
Firstpage :
1144
Lastpage :
1153
Abstract :
Building a knowledge base (KB) describing domain-specific entities is an important problem in industry, examples including KBs built over companies (e.g. Dun & Bradstreet), skills (LinkedIn, CareerBuilder) and people (inome). The task involves several engineering challenges, including devising effective procedures for data extraction, aggregation and deduplication. Data extraction involves processing multiple information sources in order to extract domain-specific data instances. The extracted instances must be aggregated and deduplicated; that is, instances referring to the same underlying entity must be identified and merged. This paper describes a pipeline developed at CareerBuilder LLC for building a KB describing employers, by first extracting entities from both global, publicly available data sources (Wikipedia and Freebase) and a proprietary source (Infogroup), and then deduplicating the instances to yield an employer-specific KB. We conduct a range of pilot experiments over three independently labeled datasets sampled from the extracted KB, and comment on some lessons learned.
Keywords :
"Data mining","Knowledge based systems","Companies","Encyclopedias","Electronic publishing","Internet"
Publisher :
ieee
Conference_Titel :
Big Data (Big Data), 2015 IEEE International Conference on
Type :
conf
DOI :
10.1109/BigData.2015.7363868
Filename :
7363868
Link To Document :
بازگشت