LevenshteinOWA operators for gene product similarity, clustering, and knowledge discovery

Author

Keller, James M. ; Bezdek, James C. ; Popescu, Mihail ; Pal, Nikhil ; Mitchell, Joyce A. ; Huband, Jacalyn

Author_Institution

Electr. & Comput. Eng. Dept., Univ. of Missouri-Columbia, Columbia, MO, USA

fYear

2005

fDate

26-28 June 2005

Firstpage

233

Lastpage

234

Abstract

The human genome project and its related research bring the promise of a revolution in our knowledge of the function of genes and their interactions with environmental factors across multiple species. However, there is a huge amount of information related to these genes that is located in multiple knowledge sources available on the Internet, with much of the data changing rapidly. High density microarray technologies and newer proteomics techniques allow for the analyses of thousands of genes. In clustering and subsequent knowledge discovery on unknown gene products, the primary features to date are the gene sequence and expression values found following a microarray experiment. One major goal is to determine the function of this gene product and its similarity in function or structure to other up-regulated or down-regulated gene products. Many measures have been proposed to calculate closeness of sequences. However, for many gene products, additional information comes from the set of gene ontology (GO) annotations and the set of journal abstracts related to the gene product. For these genes, it is reasonable to include similarity measures based on the terms found in the GO and/or the index term sets of the related documents (MeSH annotations). In both cases we deal with comparing two sets of terms coming from a taxonomy (GO or MeSH.). In this talk we propose ordered weighted average (OWA) measures for computing the similarity of two sets of terms found in a taxonomy (and hence, the two gene products annotated with terms from the taxonomy). The operators we identify are also known as linear combinations of order statistics (LOS), special cases of the Choquet integral, and qualify as OWA operators because the weights are defined by linguistic quantifiers. Using them, we build similarity measures based on the collection of pairwise term "associations" found using an information theoretic approach. The advantage of the OWA operators is the fact that they are simply built out of pairwise coefficients of association and the measures for the integral fusion can be tailored to produce "linguistic" combinations, e.g., "at least two terms must support the connection". We present examples and comparisons and show the use of our GO approach for knowledge discovery, anno- tation verification, visualization and clustering.

Keywords

biology computing; data mining; data visualisation; genetics; information theory; ontologies (artificial intelligence); Choquet integral; gene clustering; gene ontology; gene product annotation; gene product similarity measure; information theory; knowledge discovery; linear linguistic combination; linguistic quantifier; order statistics; ordered weighted average measure; similarity computing; Abstracts; Bioinformatics; Environmental factors; Genomics; Humans; Internet; Ontologies; Open wireless architecture; Proteomics; Taxonomy;

fLanguage

English

Publisher

ieee

Conference_Titel

Fuzzy Information Processing Society, 2005. NAFIPS 2005. Annual Meeting of the North American

Print_ISBN

0-7803-9187-X

Type

conf

DOI

10.1109/NAFIPS.2005.1548539

Filename

1548539

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=2641895