Duplicate Discovery on 2 Billion Internet Images

Author

Xin-Jing Wang ; Lei Zhang ; Ce Liu

Author_Institution

Microsoft Res. Asia, Beijing, China

fYear

2013

fDate

23-28 June 2013

Firstpage

429

Lastpage

436

Abstract

Duplicate image discovery, or discovering duplicate image clusters, is a challenging problem for billions of Internet images due to the lack of good distance metric which both covers the large variation within a duplicate image cluster and eliminates false alarms. After carefully investigating existing local and global features that have been widely used for large-scale image search and indexing, we propose a two-step approach that combines both local and global features: global descriptors are used to discover seed clusters with high precision, whereas local descriptors are used to grow the seeds to cover good recall. Using efficient hashing techniques for both features and the MapReduce framework, our system is able to discover about 553.8 million duplicate images from 2 billion Internet images within 13 hours on a 2, 000 core cluster.

Keywords

Internet; file organisation; image retrieval; indexing; parallel algorithms; parallel programming; search engines; Internet images; MapReduce framework; distance metric; duplicate image cluster discovery; duplicate image discovery; false alarm elimination; global descriptors; hashing techniques; image search engines; large-scale image indexing; large-scale image search; local descriptors; Clustering algorithms; Databases; Feature extraction; Internet; Merging; Semantics; Vectors;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Vision and Pattern Recognition Workshops (CVPRW), 2013 IEEE Conference on

Conference_Location

Portland, OR

Type

conf

DOI

10.1109/CVPRW.2013.71

Filename

6595910