About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ACS Fall 2022
Invited talk
A system for understanding molecular formula patterns for non-stoichiometric, variable inorganic compounds
Abstract
Researchers in inorganic chemistry, especially in battery, superconductor, or ceramics, often describe non-stoichiometric compounds in publications as patterns with variable formulas and range limits to describe a set of compounds with expected similar characteristics. Two examples are “SiLixOy wherein 0.05<x<0.7 and 0.9<y<1.1” or “LiNi1−x MnxO2x (where 0<x<1).” These patterns make search of the literature for specific element ratios difficult because the ranges described in such formulas do not reduce to a single formula with whole number subscripts. To begin expanding the discovery of such specified compounds, IBM created a system for parsing and indexing these non-stoichiometric molecular formulas. The parser can find such patterns and their surrounding text in publications such as patents and articles. A second step breaks the patterns into atomic elements, subscripts, parentheses, lists and range categories and produces a JSON object model, suitable for indexing (e.g. with Solr/Lucene). The index produced enables searching of molecular formulas and patterns, supporting multiple element composition ranges which can be combined with fast full-text search. The same indexing technique is also used to efficiently support physical units, polymer and table searching. The system is designed for iterative improvements and embedded within a larger accelerated discovery environment called CIRCA (Chemical Information Resources for Cognitive Analytics).