Biomedical corpus filtering: A weak supervision paradigm with infused domain expertise
Querying biomedical documents from large databases such as PubMed is traditionally keyword-based and usually results in large volumes of documents that lack specificity. A common bottleneck of further filtering using natural language processing (NLP) techniques stems from the need for a large amount of labeled data to train a machine learning model. To overcome this limitation, we are constructing an NLP pipeline to automatically label relevant published abstracts, without fitting to any hand-labeled training data, with the goal of identifying the most promising non-cancer generic drugs to repurpose for the treatment of cancer. This work aims to programmatically filter a large set of research articles as either relevant or non-relevant, where relevance is defined as those studies that have evaluated the efficacy of non-cancer generic drugs in cancer patient populations. We use Snorkel, a Python-based weak supervision modeling library, which allows domain expertise to be infused into heuristic rules. With a robust set of rules, promising classification accuracy can be cheaply achieved on a large set of documents, making this work easily applicable to other domains.