PatCID: Large-scale chemical-structure database from images in patent documents

Ingmar Meijer; Valery Weber; Lucas Morin; Peter Staar; Junta Fuchiwaki; Masataka Hirose

ACS Fall 2023

Talk

13 Aug 2023

PatCID: Large-scale chemical-structure database from images in patent documents

Abstract

Chemical-structure databases that rely on manual data curation, while remaining an authoritative data source, have challenges to scale with the increasing volume of (patent) documents. Publicly accessible chemical-structure databases that contain data processed in an automated fashion instead therefore become increasingly popular resources to enable accelerated discovery of new molecules.

This presentation discusses the chemical-structure database PatCID (Patent Chemical-structure Image Discovery) with more than TBD_number unique molecular structures that are displayed as 2D molecular structure images in documents from patent offices in the United States (USPTO), Europe, Japan (JPO), Korea, and China, published after 1980. We found that to have good coverage of the breadth of the organic chemistry domain, in particular processing of JPO documents in addition to USPTO documents was crucial. This is because, for example, about 70% of JPO patent families in the organic chemistry domain were not extended to the USPTO. The chemical-structure database will be made publicly accessible.

For this chemical-structure database, a new graph-based visual recognition model was developed to convert 2D molecular structure images to a standard machine-readable molecular description. The model comprises a deep keypoint detector and a graph neural network that classifies atoms and bonds. A substantial precision-advantage over the often-used OSRA utility and over alternative deep learning approaches was obtained, especially for low-resolution and unconventional images frequently found in documents from patent offices in Asian Pacific. The model architecture and a trained model will be made publicly available on GitHub.

A large-scale chemical-structure database can accelerate discovery of new molecules. To substantiate this, we will present a methodology that helped discover new cyanine dye molecules. In an automated fashion, options for the distinct cyanine dye’s substructures were catalogued and white space in the patent document landscape was identified. This supported chemistry experts, who otherwise would have faced an overwhelming number of documents to digest for knowledge extraction, in their decision making.

Workshop paper