Title :
Inferring correlation between database queries: analysis of protein sequence patterns
Author :
Guigó, Roderic ; Smith, Temple F.
Author_Institution :
Dept. of Biostat., Harvard Univ., Cambridge, MA, USA
fDate :
10/1/1993 12:00:00 AM
Abstract :
Given a subset P of a database, the problem of finding the query φ in a given database attribute having the closest extension to P is addressed. In the particular case that is outlined, P is the set of protein sequences in a protein sequence database matching a given protein sequence pattern, whereas φ is a query in the annotation of the database. Ideally, φ is the description of a biological function. If the extension of φ is very similar to P, an association between the pattern and the biological function described by the query may be inferred. An algorithm that efficiently searches the query space when negation is not considered is developed. Since the query language is a first-order language, the query space may be mapped into a set algebra in which a measure of stochastic dependence-an asymptotic approximation of the correlation coefficient-is used as a measure of set similarity. The algorithm uses the algebraic properties of such a measure to reduce the time required to search the query space. A prototype implementation of the algorithm has been tested in different collections of protein sequence patterns
Keywords :
algebra; biology computing; database theory; proteins; query processing; set theory; annotation query; asymptotic approximation; correlation coefficient; correlation inference; database queries; first-order language; protein sequence database; protein sequence pattern analysis; query language; query space; set algebra; set similarity measure; stochastic dependence measurement; Biological information theory; Biomedical measurements; Cancer; Data analysis; Databases; Helium; Pattern analysis; Protein sequence; Sequences; Stochastic processes;
Journal_Title :
Pattern Analysis and Machine Intelligence, IEEE Transactions on