DocumentCode :
245013
Title :
Mp-Dissimilarity: A Data Dependent Dissimilarity Measure
Author :
Aryal, Sunil ; Kai Ming Ting ; Haffari, Gholamreza ; Washio, Takashi
Author_Institution :
Clayton Sch. of Inf. Technol., Monash Univ., Melbourne, VIC, Australia
fYear :
2014
fDate :
14-17 Dec. 2014
Firstpage :
707
Lastpage :
712
Abstract :
Nearest neighbour search is a core process in many data mining algorithms. Finding reliable closest matches of a query in a high dimensional space is still a challenging task. This is because the effectiveness of many dissimilarity measures, that are based on a geometric model, such as lp-norm, decreases as the number of dimensions increases. In this paper, we examine how the data distribution can be exploited to measure dissimilarity between two instances and propose a new data dependent dissimilarity measure called ´mp-dissimilarity´. Rather than relying on geometric distance, it measures the dissimilarity between two instances in each dimension as a probability mass in a region that encloses the two instances. It deems the two instances in a sparse region to be more similar than two instances in a dense region, though these two pairs of instances have the same geometric distance. Our empirical results show that the proposed dissimilarity measure indeed provides a reliable nearest neighbour search in high dimensional spaces, particularly in sparse data. Mp-dissimilarity produced better task specific performance than lp-norm and cosine distance in classification and information retrieval tasks.
Keywords :
data mining; probability; query processing; search problems; cosine distance; data dependent dissimilarity measure; data distribution; data mining algorithms; geometric distance; geometric model; high dimensional space; information retrieval tasks; lp-norm; mp-dissimilarity measures; nearest neighbour search; probability mass; reliable nearest neighbour search; Accuracy; Approximation methods; Data mining; Educational institutions; Electronic mail; Information retrieval; Vectors; distance measure; lp-norm; mp-dissimilarity;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining (ICDM), 2014 IEEE International Conference on
Conference_Location :
Shenzhen
ISSN :
1550-4786
Print_ISBN :
978-1-4799-4303-6
Type :
conf
DOI :
10.1109/ICDM.2014.33
Filename :
7023388
Link To Document :
بازگشت