مرکز منطقه ای اطلاع رساني علوم و فناوري - Protein annotatorsʹ assistant: A novel application of information retrieval techniques

Abstract :

The Protein Annotatorsʹ Assistant (or PAA) (http://www.ebi.ac.uk/paa/) is a software system which assists protein annotators in the task of assigning functions to newly sequenced proteins. Working backward from SwissProt, a database which describes known proteins, and a prior sequence similarity search that returns a list of known proteins similar to a query, PAA suggests keywords and phrases which may describe functions performed by the query. In a preprocessing step, a database is built from the protein names that appear in the SwissProt database, and against each protein are listed key words and phrases that are extracted from the corresponding text records. Common words either in general English usage or from the biological domain are removed as the phrases are assembled. This process is assisted by the use of a simple stemming algorithm, which extends the list of stop-words (i.e., reject words), together with a list of accept-words. At runtime, the search algorithm, invoked by a user via a Web interface, takes a list of protein names and clusters the named proteins around keywords/phrases shared by members of the list. The assumption is that if these proteins have a particular keyword/phrase in common, and they are related to a query protein, then the keyword/phrase may also describe the query. Overall, PAA employs a number of IR techniques in a novel setting and is thus related to text categorization, where multiple categories may be suggested, except that in this case none of the categories are specified in advance.