Title :
Duplicate Discovery on 2 Billion Internet Images
Author :
Xin-Jing Wang ; Lei Zhang ; Ce Liu
Author_Institution :
Microsoft Res. Asia, Beijing, China
Abstract :
Duplicate image discovery, or discovering duplicate image clusters, is a challenging problem for billions of Internet images due to the lack of good distance metric which both covers the large variation within a duplicate image cluster and eliminates false alarms. After carefully investigating existing local and global features that have been widely used for large-scale image search and indexing, we propose a two-step approach that combines both local and global features: global descriptors are used to discover seed clusters with high precision, whereas local descriptors are used to grow the seeds to cover good recall. Using efficient hashing techniques for both features and the MapReduce framework, our system is able to discover about 553.8 million duplicate images from 2 billion Internet images within 13 hours on a 2, 000 core cluster.
Keywords :
Internet; file organisation; image retrieval; indexing; parallel algorithms; parallel programming; search engines; Internet images; MapReduce framework; distance metric; duplicate image cluster discovery; duplicate image discovery; false alarm elimination; global descriptors; hashing techniques; image search engines; large-scale image indexing; large-scale image search; local descriptors; Clustering algorithms; Databases; Feature extraction; Internet; Merging; Semantics; Vectors;
Conference_Titel :
Computer Vision and Pattern Recognition Workshops (CVPRW), 2013 IEEE Conference on
Conference_Location :
Portland, OR
DOI :
10.1109/CVPRW.2013.71