Molecular Descriptor for Global Relationship of intra-Molecular Substructures
Abstract
Fluorescent organic dyes are widely applied in diverse fields, such as OLEDs, sensors, solar cell, medicine, and drug delivery. Extensive research efforts have been dedicated to develop new dyes with desired photophysical and photochemical properties. Photophysical and photochemical properties depend on intra-molecular interactions resulting from global relationships of substructures, e.g., distance of Donor-Acceptor and/or conjugated systems, within the molecule, especially for large-scale molecular structures. Machine learning (ML) has played a significant role in accelerating material discovery aiming to reduce the time/cost and increase variability. ML models, specifically designed for predicting properties, are trained using features that encapsulate the characteristics of molecules, including molecular descriptors which capture different facets of these molecules. Consequently, the efficiency with which structural features are extracted plays a crucial role. Various molecular descriptors have been developed, ranging from Quantitative Structure-Property Relationships (QSPR) based descriptors, which basically enumerate constituent elements, to neural-network-based descriptors. However, they still have limitations in accurately capturing global relationship of intra-molecular substructures. Herein, we introduce a new molecular descriptor - Topological Distance of intra-Molecular Substructures (TDiMS), which can extract topological distance between each pair of substructures within a molecule. A topological distance between a substructure pair is approximately defined as the total mean of the shortest bond distances between atoms constituting each substructure. We aim to capture the distance with spread in order to be independent of the shape of particular substructures. Additionally, using this calculation method enables to freely target any desired fragment. In this study, we targeted heavy atoms, circular substructures derived from Morgan Fingerprint, and fragments related to organic solar cells. The feature vector derived by the proposed TDiMS approach includes values that are directly linked to the topological distance between pairs of substructures. More precisely, this study utilized either the inverse square or the inverse of the topological distance, considering factors like Coulomb's law and conjugated systems. Our evaluations reveal that TDiMS outperformed six representative descriptors based on both QSPR and neural networks in prediction model for several tasks on dye-related datasets. Across all tasks, TDiMS achieved an average enhancement rate of 17% over other benchmark descriptors. Moreover, further analysis indicates that TDiMS actually captured the crucial features that significantly contributed towards accurate target property prediction. These features collectively offered chemical insights into substructure pairs, emphasizing the importance of topological distance in molecular design. This study also provides an important direction for neural network development that combining topological distance of intra-molecular substructures information can lead to further improvement.