Title :
Robust and distributed web-scale near-dup document conflation in microsoft academic service
Author :
Chieh-Han Wu;Yang Song
Author_Institution :
Microsoft Research, Redmond One Microsoft Way, Redmond, WA, USA
Abstract :
In modern web-scale applications that collect data from different sources, entity conflation is a challenging task due to various data quality issues. In this paper, we propose a robust and distributed framework to perform conflation on noisy data in the Microsoft Academic Service dataset. Our framework contains two major components. In the offline component, we train a GBDT model to determine whether two papers from different sources should be conflated to the same paper entity. In the online component, we propose a scalable shingling algorithm that can apply our offline model to over 100 million instances. The result shows that our algorithm can conflate noisy data robustly and efficiently.
Keywords :
"Algorithm design and analysis","Noise measurement","Resource management","Data models","Robustness","Computational modeling","Proteins"
Conference_Titel :
Big Data (Big Data), 2015 IEEE International Conference on
DOI :
10.1109/BigData.2015.7364059