Autoencoder based on Graph and Recurrent Neural Networks and Application to Property Prediction
Abstract
Machine learning has been applied to various subjects in material science, including property prediction and molecular structure generation. Successful machine learning models can lead to discovering promising materials more quickly. However, a challenge remains on automatically acquiring features that can effectively represent materials. An autoencoder attempts to learn an effective representation in a so-called latent space and has been applied to learn for molecular structures. Given a molecular structure as input, the encoder of the autoencoder encodes its input into a vector in the latent space called a latent vector. The decoder of the autoencoder decodes that latent vector back to the original molecular structure. The autoencoder has a potential to be able to learn features as latent vectors even without labeled training data. Molecular Hypergraph Grammar Variational Auto-Encoder (MHG-VAE) consists of an encoder and a decoder based on recurrent neural networks combined with molecular hypergraph grammar. While MHG-VAE can additionally ensure structural validity of decoded molecules, it suffers from a drawback that new molecular structures cannot always be encoded. MHG-VAE, therefore, has a limited applicability to address downstream tasks on new molecules, when their features need to be represented as the latent vectors of MHG-VAE. We introduce a new autoencoder that can always encode any molecular structure. Our autoencoder achieves this advantage by replacing the encoder of MHG-VAE with a graph neural network. Our autoencoder inherits all the other advantages of MHG-VAE including the structural validity of the decoder. We have trained an autoencoder model with a large set of molecules available at the PubChem database. We have also trained a prediction model that receives the latent space of the autoencoder model as input to perform a prediction of a material property. We show that such a downstream task can have molecules that cannot be encoded by MHG-VAE. Additionally, our prediction model outperforms MHG-VAE even if the training and test datasets are restricted to the molecules encoded by MHG-VAE.