DocumentCode :
3717434
Title :
Robust and distributed web-scale near-dup document conflation in microsoft academic service
Author :
Chieh-Han Wu;Yang Song
Author_Institution :
Microsoft Research, Redmond One Microsoft Way, Redmond, WA, USA
fYear :
2015
Firstpage :
2606
Lastpage :
2611
Abstract :
In modern web-scale applications that collect data from different sources, entity conflation is a challenging task due to various data quality issues. In this paper, we propose a robust and distributed framework to perform conflation on noisy data in the Microsoft Academic Service dataset. Our framework contains two major components. In the offline component, we train a GBDT model to determine whether two papers from different sources should be conflated to the same paper entity. In the online component, we propose a scalable shingling algorithm that can apply our offline model to over 100 million instances. The result shows that our algorithm can conflate noisy data robustly and efficiently.
Keywords :
"Algorithm design and analysis","Noise measurement","Resource management","Data models","Robustness","Computational modeling","Proteins"
Publisher :
ieee
Conference_Titel :
Big Data (Big Data), 2015 IEEE International Conference on
Type :
conf
DOI :
10.1109/BigData.2015.7364059
Filename :
7364059
Link To Document :
بازگشت