SPOT the drug! An unsupervised pattern matching method to extract drug names from very large clinical corpora

Anni Coden; Daniel Gruhl; Neal Lewis; Michael Tanenblatt; Joe Terdiman

doi:10.1109/HISB.2012.16

ICHI 2012

Conference paper

01 Dec 2012

SPOT the drug! An unsupervised pattern matching method to extract drug names from very large clinical corpora

View publication

Abstract

Although structured electronic health records are becoming more prevalent, much information about patient health is still recorded only in unstructured text. "Understanding" these texts has been a focus of natural language processing research for many years, with some remarkable successes. Knowing the drugs patients take is not only critical for understanding patient health (e.g., for drug-drug interactions or drug-enzyme interaction), but also for secondary uses, such as research on treatment effectiveness. Several drug dictionaries have been curated, such as RxNorm or FDA's Orange Book, with a focus on prescription drugs. Developing these dictionaries is a challenge, but even more challenging is keeping these dictionaries up-to-date in the face of a rapidly advancing field. To discover other, new adverse drug interactions, a large number of patient histories often need to be examined, necessitating not only accurate but also fast algorithms to identify pharmacological substances. We propose a new algorithm, SPOT, which identifies drug names that can be used as new dictionary entries from a large corpus, where a "drug" is defined as a substance intended for use in the diagnosis, cure, mitigation, treatment, or prevention of disease. Measured against a manually annotated gold-standard corpus, we present precision and recall values for SPOT. SPOT is language and syntax independent, can be run efficiently to keep dictionaries up-to-date and to also suggest words and phrases which may be misspellings or uncatalogued synonyms of a known drug. We show how SPOT's lack of reliance on NLP tools makes it robust in analyzing clinical medical text. SPOT is a generalized bootstrapping algorithm, seeded with a known dictionary and automatically extracting the context within which each drug is mentioned. We define three features of such context: support, confidence and prevalence. We present the performance tradeoffs depending on the thresholds chosen for these features. © 2012 IEEE.

Conference paper