DocumentCode :
2763057
Title :
A clustered indexing method for optimizing the query for biological database
Author :
Kannan, S. Thabasu ; Iyakutti, K.
Author_Institution :
Dept of Comput. Sci., Sourashtra Coll., Madurai, India
fYear :
2009
fDate :
17-19 March 2009
Firstpage :
1
Lastpage :
6
Abstract :
Genome datasets have been growing exponentially in the past few years. With this rapid growth, genome datasets and the associated access structures have become too larger to fit in the main memory of a computer. It leads to a large number of disk accesses and slow response times occurred for queries. Here we should take all possible efforts to develop proper tools to access the data and mine them efficiently, otherwise, data mine will be wasted and leads to increase the search time and lack of efficiency. This paper describes a new architecture for the approximate matching of unstructured string data using clustering and indexes. Here we are using projected clustering algorithm, HARP for effective clustering. Because it heavily supports for clustering a high dimensional data. Some existing algorithms depend on some critical user parameters in determining the relevant attributes of each cluster. In case wrong parameter values are used, the clustering performance will be seriously degraded. The correct parameter values are rarely known in real datasets. However, it responds to the clustering status and adjusts the internal thresholds dynamically. The second component of the model, a new metric index, called M+Tree is used for very large dataset. Because it contains the key dimension feature, which effectively reduces the response time for similarity search. The main idea behind here is to make the fan-out of tree larger by partitioning a subspace further into two subspaces, called twin-nodes. By utilizing the twin-nodes, the filtering effectiveness can be doubled. In addition, for ensuring high space utilization, data will be reallocated dynamically between the twin nodes. The new method has been tested with both simulated and real expression data. The results show that it is able to uncover interesting patterns effectively. Based on these patterns, overlapping clusters can be discovered. It can be better understood that the expression levels at which each cluster of- genes co-expresses under different conditions.
Keywords :
bioinformatics; data mining; data structures; database indexing; genomics; pattern clustering; query processing; very large databases; HARP; M+Tree; access structures; biological database query; clustered indexing method; data mining; genome datasets; projected clustering algorithm; search time; twin nodes; unstructured string data; very large dataset; Arrays; Clustering algorithms; Neurons; Partitioning algorithms; Power measurement; Prediction algorithms; Topology; Clustering; Indexing; Overlapping clusters; Query; Searching; bioinformatics; data mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
GCC Conference & Exhibition, 2009 5th IEEE
Conference_Location :
Kuwait City
Print_ISBN :
978-1-4244-3885-3
Type :
conf
DOI :
10.1109/IEEEGCC.2009.5734246
Filename :
5734246
Link To Document :
بازگشت