SPOT the Drug! An Unsupervised Pattern Matching Method to Extract Drug Names from Very Large Clinical Corpora

Author

Coden, Anni ; Gruhl, Daniel ; Lewis, Neal ; Tanenblatt, Michael ; Terdiman, Joe

Author_Institution

T.J. Watson Res. Center, IBM, Hawthorne, NY, USA

fYear

2012

fDate

27-28 Sept. 2012

Firstpage

Lastpage

Abstract

Although structured electronic health records are becoming more prevalent, much information about patient health is still recorded only in unstructured text. “Understanding” these texts has been a focus of natural language processing (NLP) research for many years, with some remarkable successes, yet there is more work to be done. Knowing the drugs patients take is not only critical for understanding patient health (e.g., for drug-drug interactions or drug-enzyme interaction), but also for secondary uses, such as research on treatment effectiveness. Several drug dictionaries have been curated, such as RxNorm, FDA´s Orange Book, or NCI, with a focus on prescription drugs. Developing these dictionaries is a challenge, but even more challenging is keeping these dictionaries up-to-date in the face of a rapidly advancing field-it is critical to identify grapefruit as a “drug” for a patient who takes the prescription medicine Lipitor, due to their known adverse interaction. To discover other, new adverse drug interactions, a large number of patient histories often need to be examined, necessitating not only accurate but also fast algorithms to identify pharmacological substances. In this paper we propose a new algorithm, SPOT, which identifies drug names that can be used as new dictionary entries from a large corpus, where a “drug” is defined as a substance intended for use in the diagnosis, cure, mitigation, treatment, or prevention of disease. Measured against a manually annotated reference corpus, we present precision and recall values for SPOT. SPOT is language and syntax independent, can be run efficiently to keep dictionaries up-to-date and to also suggest words and phrases which may be misspellings or uncatalogued synonyms of a known drug. We show how SPOT´s lack of reliance on NLP tools makes it robust in analyzing clinical medical text. SPOT is a generalized bootstrapping algorithm, seeded with a known dictionary - nd automatically extracting the context within which each drug is mentioned. We define three features of such context: support, confidence and prevalence. Finally, we present the performance tradeoffs depending on the thresholds chosen for these features.

Keywords

dictionaries; diseases; drugs; health care; information retrieval; medical information systems; natural language processing; patient diagnosis; patient treatment; pattern matching; statistical analysis; text analysis; FDA Orange Book; NCI; NLP; RxNorm; SPOT algorithm; clinical corpora; clinical medical text analysis; disease cure; disease diagnosis; disease mitigation; disease prevention; drug dictionaries; drug name extraction; drug-drug interactions; drug-enzyme interaction; generalized bootstrapping algorithm; natural language processing; patient health information; patient history; patient treatment; pharmacological substances; prescription drugs; prescription medicine Lipitor; structured electronic health records; uncatalogued synonym; unstructured text; unsupervised pattern matching method; Accuracy; Context; Dictionaries; Drugs; Semantics; Syntactics; Training; Dictionary expansion; context matching; drug interactions; drugs; healthcare; medications; pattern discovery; pattern matching; unsupervised learning;

fLanguage

English

Publisher

ieee

Conference_Titel

Healthcare Informatics, Imaging and Systems Biology (HISB), 2012 IEEE Second International Conference on

Conference_Location

San Diego, CA

Print_ISBN

978-1-4673-4803-4

Type

conf

DOI

10.1109/HISB.2012.16

Filename

6366185

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=579462