DocumentCode :
3104609
Title :
An Information Theoretic Approach to Detection of Minority Subsets in Database
Author :
Ando, Shin ; Suzuki, Einoshin
Author_Institution :
Grad. Sch. of Eng., Yokohama Nat. Univ., Yokohama
fYear :
2006
fDate :
18-22 Dec. 2006
Firstpage :
11
Lastpage :
20
Abstract :
Detection of rare and exceptional occurrences in large- scale databases have become an important practice in the field of knowledge discovery and information retrieval. Many databases include large amount of noise or irrelevant data, whose distribution often overlaps with the subsets of exceptional data containing useful knowledge. This paper addresses the problem of finding a small subset of "minority" data whose distribution overlaps with, but are exceptional to or inconsistent with that of the majority of the database. In such a case, conventional distance-based or density-based approaches in Outlier Detection are ineffective due to their dependence on the structure of the majority or the prerequisite of critical parameters. We formalize the task as an estimation of a model of the minority subset which provides a simple description of the subset and yet maintains divergence from that of the majority. This estimation is formalized as a minimization problem using an information theoretic framework of Rate Distortion theory. We further introduce conditions of the majority to derive an objective function which factorizes the property of the minority and dependence to the structure of the majority. The proposed method shows improvements from conventional approaches in artificial data and a promising result in document retrieval problem.
Keywords :
data mining; information retrieval; rate distortion theory; very large databases; information retrieval; information theory; knowledge discovery; large-scale databases; minority subsets; outlier detection; rate distortion theory; Data engineering; Data mining; Databases; Gene expression; Information retrieval; Information science; Knowledge engineering; Rate distortion theory; Rate-distortion; Unsupervised learning;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2006. ICDM '06. Sixth International Conference on
Conference_Location :
Hong Kong
ISSN :
1550-4786
Print_ISBN :
0-7695-2701-7
Type :
conf
DOI :
10.1109/ICDM.2006.19
Filename :
4053030
Link To Document :
بازگشت