About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
SDU 2021
Conference paper
Biomedical corpus filtering: A weak supervision paradigm with infused domain expertise
Abstract
Querying biomedical documents from large databases such as PubMed is traditionally keyword-based and usually results in large volumes of documents that lack specificity. A common bottleneck of further filtering using natural language processing (NLP) techniques stems from the need for a large amount of labeled data to train a machine learning model. To overcome this limitation, we are constructing an NLP pipeline to automatically label relevant published abstracts, without fitting to any hand-labeled training data, with the goal of identifying the most promising non-cancer generic drugs to repurpose for the treatment of cancer. This work aims to programmatically filter a large set of research articles as either relevant or non-relevant, where relevance is defined as those studies that have evaluated the efficacy of non-cancer generic drugs in cancer patient populations. We use Snorkel, a Python-based weak supervision modeling library, which allows domain expertise to be infused into heuristic rules. With a robust set of rules, promising classification accuracy can be cheaply achieved on a large set of documents, making this work easily applicable to other domains.