Abstract:
The rapid growth of molecular data and advances in deep learning have facilitated significant strides in molecular informatics. Molecular informatics is an emerging field that integrates chemistry, computational science, and artificial intelligence (AI) and employs data-driven methods to decode the relationships between molecular structures and their properties, thereby supporting drug design and material discovery. Molecular representation learning (MRL) is a fundamental aspect of molecular informatics, involves encoding molecular structures and properties into numerical vectors to provide efficient representations for downstream tasks. High-quality molecular representations are critical for accurate property prediction, optimization, and generation. However, traditional rule-based MRL methods rely on handcrafted features that are time-consuming and expert-dependent. Sequence-based MRL methods, such as the simplified molecular input line entry system (SMILES), often separate connected atoms into distant positions, leading to suboptimal representations that fail to fully capture spatial and topological information. In contrast, given that molecules naturally form graph structures with atoms as nodes and bonds as edges, graph-based models can effectively utilize these molecular graphs. Aided by the exceptional performance of graph models in representing complex structures, learning cross-scale features, and constrained optimization, graph-based MRL methods have achieved significant advancements in the prediction and generation of molecular properties. In this review, we first introduce the evolution of molecular representation methods, focusing on 2D and 3D molecular graph representations. We then classify the graph models into discriminative and generative categories and discuss their concepts and applications. Graph-discriminative models encode topological structures and node/edge features to capture nonlinear structure-property relationships for classification and regression tasks. Graph-generative models learn from molecular distributions to optimize existing structures or design novel compounds with the desired properties. Next, we review commonly used datasets, evaluation metrics, and research progress in molecular property prediction and molecular generation. Molecular property prediction is employed to predict physical and chemical properties by analyzing internal molecular information, thereby helping researchers quickly identify suitable candidates from a large number of potential compounds. We briefly present the three categories of the property prediction methods: 2D graph-based, 3D graph-based, and domain knowledge-integrated approaches, and introduce a recent representative method for each category. Furthermore, we review the research focusing on various graph neural network models in material property prediction tasks and their corresponding application scenarios. The goal of molecular generation is to learn latent distributions from limited datasets and generate novel structures that satisfy specific chemical functions through sampling and decoding. We introduce widely used frameworks for molecular generation such as variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, and diffusion models, which have demonstrated strong capabilities in capturing complex molecular features and optimizing chemical properties while preserving chemical validity. In addition, using crystal material generation as an example, we introduce and compare different deep generative models for material discovery, highlighting their specific application scenarios, strengths, and limitations. Finally, we discuss future research directions for graph models in molecular informatics from the perspectives of large-scale pre-training, explainable AI, and multimodal learning strategies. This review aims to assist molecular informatics researchers in identifying cutting-edge studies and applicable methods, while clarifying the technical pathways for AI researchers to promote more efficient algorithm design and implementation.