DocumentCode :
2405008
Title :
Fast mining of massive tabular data via approximate distance computations
Author :
Cormode, Graham ; Indyk, Piotr ; Koudas, Nick ; Muthukrishnan, S.
Author_Institution :
Dept of Comput. Sci., Univ. of Warwick, UK
fYear :
2002
fDate :
2002
Firstpage :
605
Lastpage :
614
Abstract :
Tabular data abound in many data stores: traditional relational databases store tables, and new applications also generate massive tabular datasets. We present methods for determining similar regions in massive tabular data. Our methods are for computing the "distance" between any two subregions of tabular data: they are approximate, but highly accurate as we prove mathematically, and they are fast, running in time nearly linear in the table size. Our methods are general since these distance computations can be applied to any mining or similarity algorithms that use Lp norms. A novelty of our distance computation procedures is that they work for any Lp norms, not only the traditional p = 2 or p = 1, but for all p ⩽ 2; the choice of p, say fractional p, provides an interesting alternative similarity behavior! We use our algorithms in a detailed experimental study of the clustering patterns in real tabular data obtained from one of AT&T\´s data stores and show that our methods are substantially faster than straightforward methods while remaining highly accurate, and able to detect interesting patterns by varying the value of p
Keywords :
data mining; pattern clustering; relational databases; very large databases; data stores; distance computation procedures; experimental study; large databases; massive tabular datasets; pattern clustering; relational databases; table size; tabular data mining; Application software; Base stations; Cellular phones; Clustering algorithms; Computer science; Data mining; Database systems; Internet; Relational databases; Telecommunication traffic;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering, 2002. Proceedings. 18th International Conference on
Conference_Location :
San Jose, CA
ISSN :
1063-6382
Print_ISBN :
0-7695-1531-2
Type :
conf
DOI :
10.1109/ICDE.2002.994778
Filename :
994778
Link To Document :
بازگشت