DocumentCode :
1989407
Title :
Refining the extraction of relevant documents from biomedical literature to create a corpus for pathway text mining
Author :
Harte, Rachel ; Lu, Yan ; Osborn, Stephen ; Dehoney, David ; Chin, Daniel
Author_Institution :
PPD Discovery, Inc., Mento Park, CA, USA
fYear :
2003
fDate :
11-14 Aug. 2003
Firstpage :
644
Lastpage :
645
Abstract :
For biologists to keep up with developments in their field or related fields, automation is desirable to more efficiently read and interpret a rapidly growing literature. Identification of proteins or genes and their interactions can facilitate the mapping of canonical or evolving pathways from the literature. In order to mine such data, we developed procedures and tools to pre-qualify documents for further analysis. Initially, a corpus of documents for proteins of interest was built using alternate symbols from Locuslink and the Stanford SOURCE as MEDLINE search terms. The query was refined using the optimum keywords together with MeSH terms combined in a Boolean query to minimize false positives. The document space was examined using a strategy employing; latent semantic indexing (LSI), which uses Entrez\´s "related papers" utility for MEDLINE. Documents\´ relationships were visualized using an undirected graph and scored by their relatedness. Distinct document clusters, formed by the most highly connected related papers, are mostly composed of abstracts relating to one aspect of research. This feature was used to filter irrelevant abstracts, which resulted in a reduction in corpus size of 10% to 30% depending on the domain. The excluded documents were examined to confirm their lack of relevance. Corpora consisted of the most relevant documents thus reducing the number of false positives and irrelevant examples in the training set for pathway mapping. Documents were tagged, using a modified version of GATE2, with terms based on GO for rule induction using RAPIER.
Keywords :
biology computing; data mining; genetics; molecular biophysics; proteins; Boolean query; Entrez related paper; GATE2; LocusLink; MEDLINE search terms; MeSH terms; RAPIER; Stanford SOURCE; automation; biologists; biomedical literature; canonical mapping; corpora; corpus creation; data mining; document clusters; documents relationship visualization; genes identification; genetic ontology; latent semantic indexing; optimum keywords; pathway mapping; pathway text mining; pre-qualify documents; proteins identification; relevant documents extraction; rule induction; symbols usage; undirected graph; Abstracts; Automation; Data mining; Filters; Indexing; Large scale integration; Proteins; Text analysis; Text mining; Visualization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Bioinformatics Conference, 2003. CSB 2003. Proceedings of the 2003 IEEE
Print_ISBN :
0-7695-2000-6
Type :
conf
DOI :
10.1109/CSB.2003.1227432
Filename :
1227432
Link To Document :
بازگشت