• DocumentCode
    2513953
  • Title

    Application of Kernel Functions for Accurate Similarity Search in Large Chemical Databases

  • Author

    Wang, Xiaohong ; Huan, Jun ; Smalter, Aaron ; Lushington, Gerald H.

  • Author_Institution
    Sch. of Electr. Eng. & Comput. Sci., Univ. of Kansas, Lawrence, KS, USA
  • fYear
    2009
  • fDate
    1-4 Nov. 2009
  • Firstpage
    356
  • Lastpage
    361
  • Abstract
    Similarity search in chemical structure databases is an important problem with many applications in chemical genomics, drug design, and efficient chemical probe screening among others. It is widely believed that structure based methods provide an efficient way to do the query. Recently various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models, graph kernel functions can not be applied to large chemical compound database due to the high computational complexity and the difficulties in indexing similarity search for large databases. To bridge graph kernel function and similarity search in chemical databases, we applied a novel kernel-based similarity measurement, developed in our team, to measure similarity of graph represented chemicals. In our method, we utilize a hash table to support new graph kernel function definition, efficient storage and fast search. We have applied our method, named G-hash, to large chemical databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Moreover, the similarity measurement and the index structure is scalable to large chemical databases with smaller indexing size, and faster query processing time as compared to state-of-the-art indexing methods such as Daylight fingerprints, C-tree and GraphGrep.
  • Keywords
    biochemistry; biology computing; chemistry computing; computational complexity; database indexing; drugs; genomics; graph theory; pattern classification; query processing; scientific information systems; search problems; very large databases; G-hash method; accurate similarity search; chemical genomics; chemical probe screening; computational complexity; database indexing; drug design; graph kernel function; hash table; k-nearest neighbor classification; kernel-based similarity measurement; large chemical structure database; predictive model; query processing; Bioinformatics; Chemical compounds; Computational complexity; Databases; Drugs; Genomics; Indexing; Kernel; Predictive models; Probes; KNNs search; chemical classification; chemicals; drug design; graph kernels;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Bioinformatics and Biomedicine, 2009. BIBM '09. IEEE International Conference on
  • Conference_Location
    Washington, DC
  • Print_ISBN
    978-0-7695-3885-3
  • Type

    conf

  • DOI
    10.1109/BIBM.2009.72
  • Filename
    5341762