DocumentCode :
3498058
Title :
Use of figures in literature mining for biomedical digital libraries
Author :
Chen, Nawei ; Shatkay, Hagit ; Blostein, Dorothea
Author_Institution :
Sch. of Comput., Queen´´s Univ., Kingston, Ont.
fYear :
2006
fDate :
27-28 April 2006
Lastpage :
197
Abstract :
The maintenance of biomedical digital libraries (including organism databases and protein databases) involves analysis of a large number of documents. Much work is done manually: curators study large numbers of biomedical documents while updating and annotating organism databases such as MGI (mouse genome informatics) and Flybase (a database of the fruit-fly genome). We summarize the annotation process in organism databases, and describe some of the roles played by the gene ontology and by document databases such as PubMed. Efforts are ongoing to automate parts of the annotation process. Biomedical text mining contests, such as the TREC Genomics Track (Hersh et al., 2004, 2005), define annotation subtasks, and provide training and test data. So far, these efforts have focused on the analysis of the text content of documents. We are investigating the analysis of figures in biomedical documents; the information derived from figure analysis may later be combined with the information derived from text analysis. We present an algorithm for using figures in document triage; triage involves determining which documents are relevant to a given annotation task. In our triage algorithm, we segment figures into subfigures and classify the subfigures as graphical, gel, fluorescence microscopy, and other microscopy. A secondary classification into subcategories is performed by clustering, using clusters created from the subfigures in the labeled training data. The classifications of all subfigures in a document are combined to form a document descriptor. The document descriptor is then classified using a naive Bayes classifier, as either relevant or irrelevant to the given annotation task
Keywords :
Bayes methods; data mining; digital libraries; medical computing; pattern classification; pattern clustering; text analysis; Flybase; biomedical digital library; biomedical documents; biomedical text mining; document analysis; document annotation; document classification; document clustering; document database; document descriptor; document triage; gene ontology; library maintenance; literature mining; mouse genome informatics; naive Bayes classifier; organism databases; protein databases; text analysis; Bioinformatics; Clustering algorithms; Data analysis; Databases; Genomics; Information analysis; Microscopy; Organisms; Proteins; Software libraries;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Image Analysis for Libraries, 2006. DIAL '06. Second International Conference on
Conference_Location :
Lyon
Print_ISBN :
0-7695-2531-8
Type :
conf
DOI :
10.1109/DIAL.2006.45
Filename :
1612961
Link To Document :
بازگشت