- COLING 2022
Knowledge enhanced accelerated discovery
The Accelerated Discovery (AD) initiative at IBM leverages recent advances in quantum computing, AI, and hybrid cloud to drastically increase how quickly we can discover solutions to tackle today’s most urgent problems in science. One of the challenging problems in building AI-based discovery solutions is how to combine rich information from different knowledge bases with diverse modalities to learn a compact computable representation of knowledge, which is then used to enhance new AI-based discovery solutions.
Under the scope of this project, we are working on approaches that enable quick semantic annotations of vast amounts of textual scientific articles with minimum human effort. We also target ingesting data with high quality, a.k.a focusing on ranking evidence based on veracity. These automatically annotated and trusted evidences are combined with existing domain curated knowledge bases hosted by Deep Search to create a giant multi-modal knowledge graph. Our multimodal knowledge representation learning approaches turn this giant graph into a unified and compact representation via self-supervised learning. This pretrained foundational knowledge-enhanced representation is then used for new downstream machine learning tasks in Accelerated Discovery.
Zero-shot and few shot learning
We are focusing on named entity and property extraction from texts with non-existent or small labeled data under the zero or few-shot contexts:
Named entity and property extraction from text data are classical NLP problems. Recent approaches for these problems rely on the advance of deep learning where language models are pre-trained with open-domain texts and then fine-tuned with labeled data for downstream tasks with domain-specific texts. However, under the context of knowledge extraction from domain-specific texts, it is extremely hard to collect large, labeled data for training Deep Learning models in new domains. Therefore, zero-shot or few-shot Named Entity and Property Extraction problems emerge as important research problems.
Zero and few shot approaches rely on contextual information, that may be provided as a textual description or information extracted from a knowledge base, such as properties and neigh-boring nodes.
A simple example of an entity definition is the APPLE label, accompanied by the textual description:
"An apple is an edible fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus."
The description can be used to identify and disambiguate entities in a text, extracting only relevant mentions (excluding, for example, Apple mentions referring to the company and not to the fruit). The same principle can be applied to properties extraction.
Most approaches project paragraphs and sentences into a multi-dimensional embeddings space of a pre-trained model, in order to extract entities and properties, exploiting pre-acquired knowledge.
We have released an open source framework that implements state-of-the-art algorithms for zero and few-shot entity extraction, such as Blink, GENRE, SMXM and TARS.
Zshot allows to define entities with contextual information and expands Spacy visualisation for displaying results.
The toolkit is currently in development, we are working on adding new state-of-the-art approaches and features for zero short property extraction and zero shot graph extraction.
Representation learning provides vector embeddings of entities such as proteins and molecules that efficiently encode information about those entities. Similar to the language domain, the representation of proteins and molecules can be obtained via pretraining language-alike models with sequence databases. The pretrained representation is then used to create generative models or predictive modeling in drug or material discovery .
Fusion of multi-modal databases such as protein 3D folding (PDB), protein ontology (GeneOntology), or other KGs provided by Deep Search to enhance the representation learned from sequence alone is an open research problem. We ingest diverse data-source from different knowledge-based hosted on Deep Search and turn this into a giant expressive. multi-modal knowledge graph. Our multimodal knowledge representation learning approaches turn this giant graph into a unified and compact representation via self-supervised learning. This pretrained foundational knowledge-enhanced representation is then used for downstream machine learning tasks in Accelerated Discovery.
Same graph showing data nodes for each protein and drugs (nodes are color-coded based on their modalities, including text, number, protein sequences, etc.) - this graph has 31004 nodes and 356037 edges:
The immediate consequence of making consumable all these diverse and interconnected multimodal knowledge? to “democratize” discovery, so that we can query our knowledge graphs and predict, for example using Graph Neural Networks, missing relevant links between molecules and proteins that we did not know about, and which can help us reduce the large space of candidate solutions in a scientific discovery scenario.
The accumulated scientific knowledge is the foundation upon which informed decision making is built, with huge impact across a wide range of critical applications. In particular, policy makers seek trust-worthy scientific evidence to be able to formulate optimal policy across different fields, and journalists search for reliable scientific evidence to debunk misinformation. However, the large amount and the dynamic nature of scientific papers (e.g., limitations, disagreements) make it extremely difficult even for domain experts to keep track of the current state of scientific knowledge in each field as Scientific literature is a heterogeneous space. Every study has its own limitations and scholars may disagree with each other with the new progress of scientific research.
The goal of Evidence Veracity is to help scientists to grasp the landscape of a certain topic and identify gaps/problems/new opportunities in the targeted research areas. Another objective is to identify trust-worthy scientific evidence for other downstream applications. This is a long-term goal and milestones including exploratory research. In the long term, we plan to develop a domain-agnostic scientific discourse tagger to identify the argumentative unitis in scientific publications (e.g., claims, evidences). Such a tagger will be used as a building block for aggregating different views and recognize contradictory findings across different papers. We also plan to develop text summarization technologies to present insights for the end-users.
In sum, we are in a quest to create the technology required to produce an easily consumable, curated, trustworthy and generalizable knowledge that can be used across different predictive and generative downstreaming tasks … and that’s only one of many technologies we are developing to accelerate science and increase the pace at which we discover... and we are open-sourcing most of the parts.